Re: [Tracker] Refactoring the filesystem miner



Hey,

On Wed, 2011-09-14 at 15:15 +0100, Martyn Russell wrote:
On 14/09/11 12:04, Carlos Garnacho wrote:
Hey hey,

Hi Carlos,

Lately I've been thinking on how to improve TrackerMinerFS design and
performance, as it's a big piece of code that's getting too intricate at

Me too, I will come on to my thoughts later in this email.

places. It mainly has 2 roles that we should separate further:

   * Keeping track of what files to index (either fed through the crawler
or the dir monitors)
   * actually indexing them

For each of these 2 roles TrackerMinerFS maintains one cache (mtimes for
the first, URNs for the second) that's filled in per-directory as
processing goes, which introduces a latency directly related to how
scattered is the data in the FS.

Another source of latency is the need to have a parent folder URN before
inserting the data for the file at hand, which forces a flush/commit
right before indexing files within a folder to keep
nfo:belongsToContainer consistent, but that's harder to beat.

So, my idea to improve these situations is to separate the first role
out to a separate object that is able to carry out caching operations at
a higher level than folders (probably for entire configured
directories), and would hide the crawler and the monitor to the miner.
That way the miner would query in one go what now does in scattered
chunks. Very rough testing seemed to show crawling is reduced to 30%-40%
of the original time, just ~2x the effort of only adding the directory
monitors.

That's quite impressive.

Additionally, I think a filesystem abstraction object should be in
place, where GFiles are canonicalized so every comparison afterwards can
be performed through == and !=, and directories (and related data,
mtime, URN...) are cached for a longer term, while regular files are

This would indeed be nice. The comparison right now does feel clunky and 
we've had bugs in the past about 2 GFile objects being equal with 
g_file_equal() but the pointers are different. Would be nice to simplify 
things a bit.

more short-lived. I'd expect a slightly higher memory usage with this,
but almost negligible, since we already have GFiles in memory for every
monitored directory and every file waiting to be processed/indexed.

But this would specially help in non-first indexes, as actual indexing
(mostly bound to tracker-extract) outweights these file operations.

Indeed.

Opinions?

It all sounds very good. Any ideas on time lines for this?

Not fully sure, probably could be done in 2 weeks or a bit more, I think
TrackerCrawler and TrackerMonitor can be used as is, which saves quite a
lot of work, but there are operations at the miner level that'd be
affected and deserve extra care, specially:

  * mounts/unmounts
  * moving files, overwriting files
  * moving directories
  * moving stuff in and out of inspected directory trees

We should write unit tests for these to ensure a correct behavior


My thoughts:

- It might make sense to split current functionality into more modules 
first to make things easier to refactor in turn, I've been meaning to do 
this for one or two files which are > 5k LoC.

Very much agreed :)


- How does this affect the miner-config branch which has yet to land in 
master?

It's somewhat orthogonal, but touches related code, and could also make
use of the filesystem abstraction, so it could be probably considered an
starting point to the bigger refactor.


- There are some other features I would like to see added which have 
been recently mentioned in a bug from BastienÂ, namely:

   1. Disable indexing removable media by default (how useful is this?,
      I currently only need it for my music/photos but can specify it
      directly anyway and it picks up a load of other crap like backups
      if I just do it blindly so ...)

I think it'd still make sense to be able to whitelist some specific
devices, perhaps even with nautilus integration so it shows an "index
this media?" info bar :)


   2. I wonder if we should be more clever about what we monitor, some
      ideas I had:

      A. Only monitor locations where files have changed in the last
         month to avoid wasting monitors and spending so much time
         setting them up?

Hmm, the downside of that is that you'd only notice changes on an older
directory on the next restart, I'd rather try to see first how fast can
we get on setting up monitors :)


      B. Don't set up monitors for removable media, just crawl them (as
         we do now anyway) when they're mounted? If data changes
         frequently on them, users can add specific locations through
         the config.

      C. Don't add monitors to directories which are obviously code
         repositories.

I quite agree there, Tracker isn't usually going the tool of choice for
code search.


   3. I think we should have some option to force indexing source code
      directories (this touches on 2C a bit). Bastien mentioned that
      developers are having issues using their desktop with projects
      checked out in $HOME somewhere and I will admit, I avoid indexing
      my source dirs. Perhaps we should do more here.

   4. Detect when the user has been away for n minutes (like Gossip and
      other IM clients have done for years) and use that to index new
      content in the background. This might have to be optional given
      some people will expect content up to date.

Being a one time thing, I'm a bit unsure about this, perhaps initial
indexing shouldn't be done at full throttle though so it doesn't feel as
taxing and it's just a bit slower, but there's certainly no magic
throttling number that's good for all.

  Carlos



 https://bugzilla.gnome.org/show_bug.cgi?id=659025

Thoughts?







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]