Re: [Tracker] Making some of our team-plannig public



This is a new idea/item that came up today:

o. Some item_remove performance improvements:

   Right now for each delete two queries are executed
     o. Set the tracker:available to false   - (a)
     o. Delete the actual resource           - (b)
                                                                 
      They are put in the process pool with the same priority. (look for
      tracker_processing_pool_push_ready_task in item_remove). This
      means that the queue looks like this:

      ababababababababababababab

      The first query (a) is being done because (b) is slow. But looking
      at the queue it's not helping much: The queue should look like
      this instead:

      aaaaaaaaaaaaabbbbbbbbbbbbb

o. The process pool must queue actions, not textual/strings of queries.
   Why? Often can queries be merged together. Why is that useful?

   o. For query (a): Instead of doing DELETE { ?c tracker:available
      true } WHERE { ?c nie:url ?c . FILTER (tracker:uri-is-descendant
      (\"%s\", ?u)) } we could let tracker:uri-is-descendant work
      similar to SQLite's/SPARQL's IN and let it accept an array. This
      could avoid a full table scan for each and every file being
      deleted and replaces it with a single full table scan for a single
      query.

   o. We can do the exact same thing, but with SQLite's/SPARQL's IN, for
      query (b)
    
       So that gives us a queue rewrite to: 

       a{for 13 items}b{for 13 items}         

       Or sometimes something like (depending on the timings of
       queue-item-execs):
       
       a{for 3 items}a{for 5 items}a{for 5 items}b{for 5 items}b{for 8 items}

       Which is still expected to be better than either 

       aaaaaaaaaaaaabbbbbbbbbbbbb    or
       ababababababababababababab


On Tue, 2011-02-22 at 18:50 +0100, Philip Van Hoof wrote:
Hi there,

I compiled public list of things that we are planning to doThese items
have different authors (I distilled and cleaned it up from a etherpad).

This list can help other teams, companies and interested developers in
understanding what our priorities and short-term future plans are.

It can also be helpful for contributors to know what to focus on in case
they want to help us getting something done at a higher priority, by
helping. We define our own prioritization, so no need to try to convince
us to reprioritize unless you also start sending us patches.

ps.   Sending of those patches is an insanely, extremely unimaginable
      good technique to get us to listen to you, though. You'll notice
      our attention shift to you almost instantly. Only because we
      can't shift our attention faster than the speed of light wont we
      do it even before you started writing the patch. In a different
      universe perhaps this would be possible. We lack experience :(

      So start now. And in this universe.

You can get in touch with the team publicly at the channel #tracker on
GimpNET's IRC servers. Key people are martyn, aleksander, juergbi,
pvanhoof, frade, JL, ottela, abustany, garnacho, rdale, marja (last two
are qsparql developers, abustany is a qcontacts-tracker developer --
they aren't Tracker-project maintainers but they often are involved in
decisions).

Check this mailing list for E-mail addresses of these people.

The list rapidly changes. Just pointing that out as a reminder (the list
isn't a promise whatsoever, plus are our priorities often "agile").

Bla bla here is the list:

o. On live.gnome.org (examples are often still using libtracker-client).
   Most of the documentation on  live.gnome talks about either DBus or
   libtracker-client, and not  libtracker-sparql. We should update all
   these documentation bits and  pieces

o. Discussion about what the behaviour of triples outside of the UNION
   block should be

o. Investigate whether we can use GVariant more often/efficient for the
   class signals and writeback signal emit features

o. During review I saw a few things in tracker-store that have a better
   way to do in Vala but that were ported 1:1 from C. We should perhaps
   try to simplify that code

o. During investigating I noticed that a cp of 40 files to a folder (no
   subfolders) only 9 insert queries get clotted together using
   UpdateArray, instead of 40. This gives us more of the overhead that
   UpdateArray can help reduce. Investigate why at 9 files already the
   UpdateArray query is fired, why it's not clotting together all 40?

        Seems item_queue_get_next_file() returns QUEUE_NONE/QUEUE_WAIT
        and in this case we always flush the updatearray buffers. Now,
        why is QUEUE_NONE/QUEUE_WAIT returned several times? It can
        only happen when the miner-fs cannot currently launch to
        process new items either because there are none (maybe there
        are but need to wait to others to get pushed to the store), or
        because reached the max number of wait items (max number of
        items that can be sent in parallel to tracker-extract,
        currently 10). Worth investigating this specific case.

        The cp of 40 files generates 40 CREATED events treated
        separately, where item_queue_handlers_setup() is called after
        each one is received, this makes easier to get
        QUEUE_NONE/QUEUE_WAIT situations. UpdateArray was initially
        meant to be used during crawling not when processing events.
        Anyway, it could be tweaked to merge also several update
        while processing events, but only if using a time threshold to
        merge events very low, not the current 15s. Not sure if
        useful, considering how new files are copied to the device
        (one by one, slow copy over USB).

o. The UpdateArray technique is not applied to the communication between
   miner-fs and tracker-extract. The exact same reason why we did
   UpdateArray also applies to the IPC communication between miner-fs
   and tracker-extract. We can implement this technique for that IPC
   too, measure the difference, etc.

        UpdateArray currently makes the store receive several requests
        together in the same dbus request (so reduces dbus overhead)
        but then inserts updates one by one to get per-update errors
        if any. We could avoid this and try to insert all the updates
        in the same run, and that would give us some more performance
        improvement. If that merged insert fails, then we could retry
        one-by-one to keep reporting per-request errors. See
      
        https://projects.maemo.org/mailman/pipermail/tracker-maintainers/2010-September/000128.html

o. We could allow multiple tracker-extract processes to run
   simultaneously.

      o. A extractor process per folder, with a max amount of
         processes of course (say 4)

      o. A extractor process per file-type or group of file-types

      o. Would be good to know anyway which is the best possible time
         we could ever achieve with current setup like computing the
         sum of all independent extraction times of all files and
         check which is the % w.r.t to the total indexing time. If %
         w.r.t to total indexing time is very very high, worth
         investigating several tracker-extracts in parallel, otherwise
         check where the bottleneck is.

o. Investigate a change in the flow so that tracker-miner-fs doesn't
   merge the SPARQL received by tracker-extract. This reduces not only
   some dbus overhead, but also the need of playing with huge chunks of
   heap memory in the miner-fs to construct the sparql update query:

      o. tracker-miner-fs extracts basic file metadata, sends it to
         the store
      o. When reply from the store received for a given file, request
         extractor to extract data from the file, passing the URN of
         the resource
      o. Let tracker-extract insert the extracted data directly in the
         store.
      o. This will also allow having files inserted in the store even
         if extraction of its contents failed (of course resource
         types would be missing apart from basic nfo ones)

o. Investigate whether a extractor module can be improved (replacing a
   library like poppler :p, because it's so damn slow, etc) for
   performance. Idem for other formats.

o. Delete artists alongside deleting a song resource (workaround for not
   having reference counting and orphan-deletion - yet)

o. Do orphan deletion (major item in future roadmap already).

o. Refactor tracker-miner-fs: libtracker-miner, splitting tracker-miner-
   fs.c into different submodules and files, as done with the processing
   pool. 

o. Limit maximum number of requests sent to the store. Not being done
   with the refactored processing pool. 

o. Try to document the execution flow of the miner-fs. 

o. Manage the list of ignored files/directories/patterns directly at
   tracker-monitor level in libtracker-miner: 

o. Events can properly be merged (e.g. a rename of a non-ignored
   filename to an  ignored filename should be notified to upper layers
   as a DELETE and not  as a MOVE). 

      o. Improves  management of actual GFileMonitors (e.g. a rename
         of a non-ignored  directory name to an ignored directory name
         should trigger removal of  the GFileMonitor in the directory;
         currently this is not done at TrackerMonitor level as it
         doesn't know about ignored directories). 

o. Refactor of tracker-writeback to be more like how tracker-extract
   works:
      o. Move tracker-writeback's "Writeback" listening to miner-fs
      o. Simplify tracker-writeback itself
      o. Remove IgnoreNextUpdate in miner-fs, what IgnoreNextUpdate
         does now can be wrapped around the DBus call to tracker-
         writeback's service.

o. Add a signal when Restore() is finished, to let clients know that
   they should restart and/or invalidate their locally cached RDF data


Cheers,

Philip


-- 


Philip Van Hoof
freelance software developer
Codeminded BVBA - http://codeminded.be




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]