Re: [Tracker] Refactoring the filesystem miner



On 14/09/11 12:04, Carlos Garnacho wrote:
Hey hey,

Hi Carlos,

Lately I've been thinking on how to improve TrackerMinerFS design and
performance, as it's a big piece of code that's getting too intricate at

Me too, I will come on to my thoughts later in this email.

places. It mainly has 2 roles that we should separate further:

   * Keeping track of what files to index (either fed through the crawler
or the dir monitors)
   * actually indexing them

For each of these 2 roles TrackerMinerFS maintains one cache (mtimes for
the first, URNs for the second) that's filled in per-directory as
processing goes, which introduces a latency directly related to how
scattered is the data in the FS.

Another source of latency is the need to have a parent folder URN before
inserting the data for the file at hand, which forces a flush/commit
right before indexing files within a folder to keep
nfo:belongsToContainer consistent, but that's harder to beat.

So, my idea to improve these situations is to separate the first role
out to a separate object that is able to carry out caching operations at
a higher level than folders (probably for entire configured
directories), and would hide the crawler and the monitor to the miner.
That way the miner would query in one go what now does in scattered
chunks. Very rough testing seemed to show crawling is reduced to 30%-40%
of the original time, just ~2x the effort of only adding the directory
monitors.

That's quite impressive.

Additionally, I think a filesystem abstraction object should be in
place, where GFiles are canonicalized so every comparison afterwards can
be performed through == and !=, and directories (and related data,
mtime, URN...) are cached for a longer term, while regular files are

This would indeed be nice. The comparison right now does feel clunky and we've had bugs in the past about 2 GFile objects being equal with g_file_equal() but the pointers are different. Would be nice to simplify things a bit.

more short-lived. I'd expect a slightly higher memory usage with this,
but almost negligible, since we already have GFiles in memory for every
monitored directory and every file waiting to be processed/indexed.

But this would specially help in non-first indexes, as actual indexing
(mostly bound to tracker-extract) outweights these file operations.

Indeed.

Opinions?

It all sounds very good. Any ideas on time lines for this?

My thoughts:

- It might make sense to split current functionality into more modules first to make things easier to refactor in turn, I've been meaning to do this for one or two files which are > 5k LoC.

- How does this affect the miner-config branch which has yet to land in master?

- There are some other features I would like to see added which have been recently mentioned in a bug from Bastien¹, namely:

  1. Disable indexing removable media by default (how useful is this?,
     I currently only need it for my music/photos but can specify it
     directly anyway and it picks up a load of other crap like backups
     if I just do it blindly so ...)

  2. I wonder if we should be more clever about what we monitor, some
     ideas I had:

     A. Only monitor locations where files have changed in the last
        month to avoid wasting monitors and spending so much time
        setting them up?

     B. Don't set up monitors for removable media, just crawl them (as
        we do now anyway) when they're mounted? If data changes
        frequently on them, users can add specific locations through
        the config.

     C. Don't add monitors to directories which are obviously code
        repositories.

  3. I think we should have some option to force indexing source code
     directories (this touches on 2C a bit). Bastien mentioned that
     developers are having issues using their desktop with projects
     checked out in $HOME somewhere and I will admit, I avoid indexing
     my source dirs. Perhaps we should do more here.

  4. Detect when the user has been away for n minutes (like Gossip and
     other IM clients have done for years) and use that to index new
     content in the background. This might have to be optional given
     some people will expect content up to date.


¹ https://bugzilla.gnome.org/show_bug.cgi?id=659025

Thoughts?

--
Regards,
Martyn

Founder and CEO of Lanedo GmbH.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]