Re: [Tracker] Making some of our team-plannig public

From: Philip Van Hoof <philip codeminded be>
To: tracker-list <tracker-list gnome org>
Subject: Re: [Tracker] Making some of our team-plannig public
Date: Wed, 02 Mar 2011 12:11:41 +0100

This is a new idea/item that came up today:

o. Some item_remove performance improvements:

   Right now for each delete two queries are executed
     o. Set the tracker:available to false   - (a)
     o. Delete the actual resource           - (b)
                                                                 
      They are put in the process pool with the same priority. (look for
      tracker_processing_pool_push_ready_task in item_remove). This
      means that the queue looks like this:

      ababababababababababababab

      The first query (a) is being done because (b) is slow. But looking
      at the queue it's not helping much: The queue should look like
      this instead:

      aaaaaaaaaaaaabbbbbbbbbbbbb

o. The process pool must queue actions, not textual/strings of queries.
   Why? Often can queries be merged together. Why is that useful?

   o. For query (a): Instead of doing DELETE { ?c tracker:available
      true } WHERE { ?c nie:url ?c . FILTER (tracker:uri-is-descendant
      (\"%s\", ?u)) } we could let tracker:uri-is-descendant work
      similar to SQLite's/SPARQL's IN and let it accept an array. This
      could avoid a full table scan for each and every file being
      deleted and replaces it with a single full table scan for a single
      query.

   o. We can do the exact same thing, but with SQLite's/SPARQL's IN, for
      query (b)
    
       So that gives us a queue rewrite to: 

       a{for 13 items}b{for 13 items}         

       Or sometimes something like (depending on the timings of
       queue-item-execs):
       
       a{for 3 items}a{for 5 items}a{for 5 items}b{for 5 items}b{for 8 items}

       Which is still expected to be better than either 

       aaaaaaaaaaaaabbbbbbbbbbbbb    or
       ababababababababababababab


On Tue, 2011-02-22 at 18:50 +0100, Philip Van Hoof wrote:

Hi there,

I compiled public list of things that we are planning to doThese items
have different authors (I distilled and cleaned it up from a etherpad).

This list can help other teams, companies and interested developers in
understanding what our priorities and short-term future plans are.

It can also be helpful for contributors to know what to focus on in case
they want to help us getting something done at a higher priority, by
helping. We define our own prioritization, so no need to try to convince
us to reprioritize unless you also start sending us patches.

ps. Sending of those patches is an insanely, extremely unimaginable
good technique to get us to listen to you, though. You'll notice
our attention shift to you almost instantly. Only because we
can't shift our attention faster than the speed of light wont we
do it even before you started writing the patch. In a different
universe perhaps this would be possible. We lack experience :(

So start now. And in this universe.

You can get in touch with the team publicly at the channel #tracker on
GimpNET's IRC servers. Key people are martyn, aleksander, juergbi,
pvanhoof, frade, JL, ottela, abustany, garnacho, rdale, marja (last two
are qsparql developers, abustany is a qcontacts-tracker developer --
they aren't Tracker-project maintainers but they often are involved in
decisions).

Check this mailing list for E-mail addresses of these people.

The list rapidly changes. Just pointing that out as a reminder (the list
isn't a promise whatsoever, plus are our priorities often "agile").

Bla bla here is the list:

o. On live.gnome.org (examples are often still using libtracker-client).
Most of the documentation on live.gnome talks about either DBus or
libtracker-client, and not libtracker-sparql. We should update all
these documentation bits and pieces

o. Discussion about what the behaviour of triples outside of the UNION
block should be

o. Investigate whether we can use GVariant more often/efficient for the
class signals and writeback signal emit features

o. During review I saw a few things in tracker-store that have a better
way to do in Vala but that were ported 1:1 from C. We should perhaps
try to simplify that code

o. During investigating I noticed that a cp of 40 files to a folder (no
subfolders) only 9 insert queries get clotted together using
UpdateArray, instead of 40. This gives us more of the overhead that
UpdateArray can help reduce. Investigate why at 9 files already the
UpdateArray query is fired, why it's not clotting together all 40?

Seems item_queue_get_next_file() returns QUEUE_NONE/QUEUE_WAIT
and in this case we always flush the updatearray buffers. Now,
why is QUEUE_NONE/QUEUE_WAIT returned several times? It can
only happen when the miner-fs cannot currently launch to
process new items either because there are none (maybe there
are but need to wait to others to get pushed to the store), or
because reached the max number of wait items (max number of
items that can be sent in parallel to tracker-extract,
currently 10). Worth investigating this specific case.

The cp of 40 files generates 40 CREATED events treated
separately, where item_queue_handlers_setup() is called after
each one is received, this makes easier to get
QUEUE_NONE/QUEUE_WAIT situations. UpdateArray was initially
meant to be used during crawling not when processing events.
Anyway, it could be tweaked to merge also several update
while processing events, but only if using a time threshold to
merge events very low, not the current 15s. Not sure if
useful, considering how new files are copied to the device
(one by one, slow copy over USB).

o. The UpdateArray technique is not applied to the communication between
miner-fs and tracker-extract. The exact same reason why we did
UpdateArray also applies to the IPC communication between miner-fs
and tracker-extract. We can implement this technique for that IPC
too, measure the difference, etc.

UpdateArray currently makes the store receive several requests
together in the same dbus request (so reduces dbus overhead)
but then inserts updates one by one to get per-update errors
if any. We could avoid this and try to insert all the updates
in the same run, and that would give us some more performance
improvement. If that merged insert fails, then we could retry
one-by-one to keep reporting per-request errors. See

https://projects.maemo.org/mailman/pipermail/tracker-maintainers/2010-September/000128.html

o. We could allow multiple tracker-extract processes to run
simultaneously.

o. A extractor process per folder, with a max amount of
processes of course (say 4)

o. A extractor process per file-type or group of file-types

o. Would be good to know anyway which is the best possible time
we could ever achieve with current setup like computing the
sum of all independent extraction times of all files and
check which is the % w.r.t to the total indexing time. If %
w.r.t to total indexing time is very very high, worth
investigating several tracker-extracts in parallel, otherwise
check where the bottleneck is.

o. Investigate a change in the flow so that tracker-miner-fs doesn't
merge the SPARQL received by tracker-extract. This reduces not only
some dbus overhead, but also the need of playing with huge chunks of
heap memory in the miner-fs to construct the sparql update query:

o. tracker-miner-fs extracts basic file metadata, sends it to
the store
o. When reply from the store received for a given file, request
extractor to extract data from the file, passing the URN of
the resource
o. Let tracker-extract insert the extracted data directly in the
store.
o. This will also allow having files inserted in the store even
if extraction of its contents failed (of course resource
types would be missing apart from basic nfo ones)

o. Investigate whether a extractor module can be improved (replacing a
library like poppler :p, because it's so damn slow, etc) for
performance. Idem for other formats.

o. Delete artists alongside deleting a song resource (workaround for not
having reference counting and orphan-deletion - yet)

o. Do orphan deletion (major item in future roadmap already).

o. Refactor tracker-miner-fs: libtracker-miner, splitting tracker-miner-
fs.c into different submodules and files, as done with the processing
pool.

o. Limit maximum number of requests sent to the store. Not being done
with the refactored processing pool.

o. Try to document the execution flow of the miner-fs.

o. Manage the list of ignored files/directories/patterns directly at
tracker-monitor level in libtracker-miner:

o. Events can properly be merged (e.g. a rename of a non-ignored
filename to an ignored filename should be notified to upper layers
as a DELETE and not as a MOVE).

o. Improves management of actual GFileMonitors (e.g. a rename
of a non-ignored directory name to an ignored directory name
should trigger removal of the GFileMonitor in the directory;
currently this is not done at TrackerMonitor level as it
doesn't know about ignored directories).

o. Refactor of tracker-writeback to be more like how tracker-extract
works:
o. Move tracker-writeback's "Writeback" listening to miner-fs
o. Simplify tracker-writeback itself
o. Remove IgnoreNextUpdate in miner-fs, what IgnoreNextUpdate
does now can be wrapped around the DBus call to tracker-
writeback's service.

o. Add a signal when Restore() is finished, to let clients know that
they should restart and/or invalidate their locally cached RDF data

Cheers,

Philip


-- 


Philip Van Hoof
freelance software developer
Codeminded BVBA - http://codeminded.be

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]