[Tracker] Making some of our team-plannig public



Hi there,

I compiled public list of things that we are planning to doThese items
have different authors (I distilled and cleaned it up from a etherpad).

This list can help other teams, companies and interested developers in
understanding what our priorities and short-term future plans are.

It can also be helpful for contributors to know what to focus on in case
they want to help us getting something done at a higher priority, by
helping. We define our own prioritization, so no need to try to convince
us to reprioritize unless you also start sending us patches.

ps.     Sending of those patches is an insanely, extremely unimaginable
        good technique to get us to listen to you, though. You'll notice
        our attention shift to you almost instantly. Only because we
        can't shift our attention faster than the speed of light wont we
        do it even before you started writing the patch. In a different
        universe perhaps this would be possible. We lack experience :(

        So start now. And in this universe.

You can get in touch with the team publicly at the channel #tracker on
GimpNET's IRC servers. Key people are martyn, aleksander, juergbi,
pvanhoof, frade, JL, ottela, abustany, garnacho, rdale, marja (last two
are qsparql developers, abustany is a qcontacts-tracker developer --
they aren't Tracker-project maintainers but they often are involved in
decisions).

Check this mailing list for E-mail addresses of these people.

The list rapidly changes. Just pointing that out as a reminder (the list
isn't a promise whatsoever, plus are our priorities often "agile").

Bla bla here is the list:

o. On live.gnome.org (examples are often still using libtracker-client).
   Most of the documentation on  live.gnome talks about either DBus or
   libtracker-client, and not  libtracker-sparql. We should update all
   these documentation bits and  pieces

o. Discussion about what the behaviour of triples outside of the UNION
   block should be

o. Investigate whether we can use GVariant more often/efficient for the
   class signals and writeback signal emit features

o. During review I saw a few things in tracker-store that have a better
   way to do in Vala but that were ported 1:1 from C. We should perhaps
   try to simplify that code

o. During investigating I noticed that a cp of 40 files to a folder (no
   subfolders) only 9 insert queries get clotted together using
   UpdateArray, instead of 40. This gives us more of the overhead that
   UpdateArray can help reduce. Investigate why at 9 files already the
   UpdateArray query is fired, why it's not clotting together all 40?

          Seems item_queue_get_next_file() returns QUEUE_NONE/QUEUE_WAIT
          and in this case we always flush the updatearray buffers. Now,
          why is QUEUE_NONE/QUEUE_WAIT returned several times? It can
          only happen when the miner-fs cannot currently launch to
          process new items either because there are none (maybe there
          are but need to wait to others to get pushed to the store), or
          because reached the max number of wait items (max number of
          items that can be sent in parallel to tracker-extract,
          currently 10). Worth investigating this specific case.

          The cp of 40 files generates 40 CREATED events treated
          separately, where item_queue_handlers_setup() is called after
          each one is received, this makes easier to get
          QUEUE_NONE/QUEUE_WAIT situations. UpdateArray was initially
          meant to be used during crawling not when processing events.
          Anyway, it could be tweaked to merge also several update
          while processing events, but only if using a time threshold to
          merge events very low, not the current 15s. Not sure if
          useful, considering how new files are copied to the device
          (one by one, slow copy over USB).

o. The UpdateArray technique is not applied to the communication between
   miner-fs and tracker-extract. The exact same reason why we did
   UpdateArray also applies to the IPC communication between miner-fs
   and tracker-extract. We can implement this technique for that IPC
   too, measure the difference, etc.

          UpdateArray currently makes the store receive several requests
          together in the same dbus request (so reduces dbus overhead)
          but then inserts updates one by one to get per-update errors
          if any. We could avoid this and try to insert all the updates
          in the same run, and that would give us some more performance
          improvement. If that merged insert fails, then we could retry
          one-by-one to keep reporting per-request errors. See
        
          https://projects.maemo.org/mailman/pipermail/tracker-maintainers/2010-September/000128.html

o. We could allow multiple tracker-extract processes to run
   simultaneously.

        o. A extractor process per folder, with a max amount of
           processes of course (say 4)

        o. A extractor process per file-type or group of file-types

        o. Would be good to know anyway which is the best possible time
           we could ever achieve with current setup like computing the
           sum of all independent extraction times of all files and
           check which is the % w.r.t to the total indexing time. If %
           w.r.t to total indexing time is very very high, worth
           investigating several tracker-extracts in parallel, otherwise
           check where the bottleneck is.

o. Investigate a change in the flow so that tracker-miner-fs doesn't
   merge the SPARQL received by tracker-extract. This reduces not only
   some dbus overhead, but also the need of playing with huge chunks of
   heap memory in the miner-fs to construct the sparql update query:

        o. tracker-miner-fs extracts basic file metadata, sends it to
           the store
        o. When reply from the store received for a given file, request
           extractor to extract data from the file, passing the URN of
           the resource
        o. Let tracker-extract insert the extracted data directly in the
           store.
        o. This will also allow having files inserted in the store even
           if extraction of its contents failed (of course resource
           types would be missing apart from basic nfo ones)

o. Investigate whether a extractor module can be improved (replacing a
   library like poppler :p, because it's so damn slow, etc) for
   performance. Idem for other formats.

o. Delete artists alongside deleting a song resource (workaround for not
   having reference counting and orphan-deletion - yet)

o. Do orphan deletion (major item in future roadmap already).

o. Refactor tracker-miner-fs: libtracker-miner, splitting tracker-miner-
   fs.c into different submodules and files, as done with the processing
   pool. 

o. Limit maximum number of requests sent to the store. Not being done
   with the refactored processing pool. 

o. Try to document the execution flow of the miner-fs. 

o. Manage the list of ignored files/directories/patterns directly at
   tracker-monitor level in libtracker-miner: 

o. Events can properly be merged (e.g. a rename of a non-ignored
   filename to an  ignored filename should be notified to upper layers
   as a DELETE and not  as a MOVE). 

        o. Improves  management of actual GFileMonitors (e.g. a rename
           of a non-ignored  directory name to an ignored directory name
           should trigger removal of  the GFileMonitor in the directory;
           currently this is not done at TrackerMonitor level as it
           doesn't know about ignored directories). 

o. Refactor of tracker-writeback to be more like how tracker-extract
   works:
        o. Move tracker-writeback's "Writeback" listening to miner-fs
        o. Simplify tracker-writeback itself
        o. Remove IgnoreNextUpdate in miner-fs, what IgnoreNextUpdate
           does now can be wrapped around the DBus call to tracker-
           writeback's service.

o. Add a signal when Restore() is finished, to let clients know that
   they should restart and/or invalidate their locally cached RDF data


Cheers,

Philip

-- 


Philip Van Hoof
freelance software developer
Codeminded BVBA - http://codeminded.be




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]