[Tracker] Refactoring the filesystem miner



Hey hey,

Lately I've been thinking on how to improve TrackerMinerFS design and
performance, as it's a big piece of code that's getting too intricate at
places. It mainly has 2 roles that we should separate further:

  * Keeping track of what files to index (either fed through the crawler
or the dir monitors)
  * actually indexing them

For each of these 2 roles TrackerMinerFS maintains one cache (mtimes for
the first, URNs for the second) that's filled in per-directory as
processing goes, which introduces a latency directly related to how
scattered is the data in the FS.

Another source of latency is the need to have a parent folder URN before
inserting the data for the file at hand, which forces a flush/commit
right before indexing files within a folder to keep
nfo:belongsToContainer consistent, but that's harder to beat.

So, my idea to improve these situations is to separate the first role
out to a separate object that is able to carry out caching operations at
a higher level than folders (probably for entire configured
directories), and would hide the crawler and the monitor to the miner.
That way the miner would query in one go what now does in scattered
chunks. Very rough testing seemed to show crawling is reduced to 30%-40%
of the original time, just ~2x the effort of only adding the directory
monitors.

Additionally, I think a filesystem abstraction object should be in
place, where GFiles are canonicalized so every comparison afterwards can
be performed through == and !=, and directories (and related data,
mtime, URN...) are cached for a longer term, while regular files are
more short-lived. I'd expect a slightly higher memory usage with this,
but almost negligible, since we already have GFiles in memory for every
monitored directory and every file waiting to be processed/indexed.

But this would specially help in non-first indexes, as actual indexing
(mostly bound to tracker-extract) outweights these file operations.

Opinions?

  Carlos




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]