Re: [Tracker] Tracker daemon/indexer responsibilities

From: Jamie McCracken <jamie mccrack googlemail com>
To: Martyn Russell <martyn imendio com>
Cc: Tracker-List <tracker-list gnome org>
Subject: Re: [Tracker] Tracker daemon/indexer responsibilities
Date: Thu, 26 Jun 2008 09:39:40 -0400

On Wed, 2008-06-25 at 19:22 +0100, Martyn Russell wrote:

Jamie McCracken wrote:

On Wed, 2008-06-25 at 16:46 +0100, Martyn Russell wrote:

The Daemon
==========

So, recently in the daemon, I just finished writing the code to crawl
the file system and queue ALL files in $HOME or where ever the config
says we should index files from. The daemon also sets up monitors for
each directory it finds along the way. This is all done using the new
GIO functions and works nicely. The files found are then sent in chunks
to the indexer to process. This includes monitor updates to files.


for non-inotify i take it we will not use GIO? Do we have enough info to
manage things this way?


Actually, I updated the tracker-monitor.c file to now detect which GIO
backend we use... FAM, Inotify, etc. The same limits are in place as was
before, except we now have the support of GLib which is better I think.


cool!

daemon should be able to get/set tags and user/app metadata not from the
index as well


Primarily the daemon only uses the database in a read only fashion right
now. Any writing to the database should be minimal I guess.


yes it should signal the indexer to go idle before writing metadata if
indexer has been spawned

For multiple sessions from one NFS home directory the existing locks
should allow only the first trackerd to spawn tracker-indexer. Others
should be read only and prevented from spawning tracker-indxer (we have
the code already for this which forces subsequent trackerd to be read
only but you might need to adapt it for this)

I think we need a libtrackermetadata to share the metadata stuff between
indexer and daemon


Care to elaborate a little on what you expect here? :)


tracker-metadata.c and a few other files appear to be shared by indexer
and daemon

The Indexer
===========

The indexer process works like a state machine with 3 queues for:

- Files
- Directores
- Modules

The files queue has the highest priority, individual files are stored
here, waiting for metadata extraction, etc... files are taken one by one
in order to be processed, when this queue is empty, a single token from
the _next_ queue is processed.

The directories queue is the _next_ queue. Directories are waiting for
inspection here. When a directory is checked the contained files and
directories will be prepended in their respective queues. When this
queue is empty, a single token from the _next_ queue is processed.

The last queue and again the _next_ queue after the directory queue is
the modules queue. When all files from the previous file have been
inspected, the next module then does its part and this continues until
all modules are finished. At this point the indexer quits. IT should be
noted here, the indexer is an impermanent entity. It only survives to
process work given to it.



Not quite what I had in mind - the indexer should be dumb and fed stuff
to index by the daemon. the exception is directories which need to be
recursively scanned (not sure we need separate queues for them)


Remember, this was an initial step to get the indexer into its own
process and to make it stand alone, the process is iterative, so we
don't consider it the final plan.

The Problem
===========

The question is, should the daemon do some of this work? The issue here
for the daemon is that what it does is highly specific to "files" only.
It doesn't know anything about instant messaging files, locations, what
should be ignored, what should be monitored, etc.

When running the indexer right now, it sits at about 25%->33% in the
background indexing files (on my laptop), on my desktop, it can index my
140k files in about 130 seconds using no throttling and the system is
very usable during this time (and we haven't optimised anything yet
either). The daemon, however, does absolutely nothing after the initial
10-15 seconds (which is how long it takes to set up 6500 monitors and
get all 140k files in my home directory 30k of which have been ignored
as being unsuitable). So the statistics look good, but the daemon can do
more and should be doing things like monitoring the desktop file
directory so we know when applications are added, removed or updated.



absolutely - all watching should be done by daemon

the indexer should be told what service is being indexed when its passed
a url. the daemon will know this a sit keeps track of which directories
belong to which service


Good, that's what I had in mind.

thats was my plan :)

:)

This work is mostly done right now. It is merely a case of moving the
architecture around a bit and moving code between processes. But is this
the right approach, what do you think? Comments welcome!


I think you are on the right track (I have not checked all the source
changes though - just a quick scan)

the main problem between indexer and daemon is which should handle
recursive indexing of folders (particularly during first time index) -
this can be done either way. I dont know which is better


Well, my idea was that ALL directories should be handed to the indexer,
the indexer should be completely dumb about files, i.e. it just indexes
what it is told. It probably makes sense to optimise this the same way
you optimised the directory monitoring in TRUNK, i.e. by breadth rather
than depth? What do you think here? This means that all files most
immediate to the top level directory the config says we should index are
indexed first and found first if the user wishes to search before the
indexing is complete.

Or did you meant something else here?


I mean should the daemon recursively check a directory and pass the
files needing to be indexed to the indexer or should it just pass the
directory to the indexer which will recursively check it.

for performance reasons the daemon should not be niced at all so its
important it does not consume too much cpu or disk I/O whilst the
indexer does all the IO/cpu intensive stuff and will be niced +19 and
ioniced as much as possible


It isn't. As per our last email about using nice and ioprio, we only do
this stuff in the indexer.

handling file moves also needs some discussion about whether the
indexer/daemon should do this


Currently we get the monitor events from GIO and push them to the
indexer from the daemon using DBus, e.g. remove_this_file (foo); or
check_this_file (bar);

Im open to which way you go here - there may also be issues with NFS
with slow performance or broken file locking so we need to be careful

architecturally you just need a libtrackermetadata for the common 
metadata routines between indexer/daemon (unless these are somewhere else?)


Hmm, I think right now there is probably duplication in the daemon and
the indexer. So you are more than likely right.


yes

anyway thanks for all your work on this - its looking very promising :)

jamie

References:
- [Tracker] Tracker daemon/indexer responsibilities
  - From: Martyn Russell
- Re: [Tracker] Tracker daemon/indexer responsibilities
  - From: Jamie McCracken
- Re: [Tracker] Tracker daemon/indexer responsibilities
  - From: Martyn Russell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]