[Tracker] Tracker daemon/indexer responsibilities

From: Martyn Russell <martyn imendio com>
To: Tracker-List <tracker-list gnome org>
Subject: [Tracker] Tracker daemon/indexer responsibilities
Date: Wed, 25 Jun 2008 16:46:58 +0100

Hi all,

Thinks are all going well on the indexer-split branch, however, it
occurred to me that daemon has its work duplicated in the indexer. We
need to resolve where the responsibility lies for the daemon and the
indexer.


The Modules
===========

First about the modules. So we have these modules, they all share a
common API. This API includes functions to:

- Index content
- Get directories
- Know if a file or directory should be ignored

The modules include:

- applications
- files
- gaim-conversations
- firefox-history

The idea is, each of these modules know how to index, locate and ignore
particular files and directories pertaining to their specific arena
(i.e. instant messaging, browsing, applications, etc).


The Daemon
==========

So, recently in the daemon, I just finished writing the code to crawl
the file system and queue ALL files in $HOME or where ever the config
says we should index files from. The daemon also sets up monitors for
each directory it finds along the way. This is all done using the new
GIO functions and works nicely. The files found are then sent in chunks
to the indexer to process. This includes monitor updates to files.


The Indexer
===========

The indexer process works like a state machine with 3 queues for:

- Files
- Directores
- Modules

The files queue has the highest priority, individual files are stored
here, waiting for metadata extraction, etc... files are taken one by one
in order to be processed, when this queue is empty, a single token from
the _next_ queue is processed.

The directories queue is the _next_ queue. Directories are waiting for
inspection here. When a directory is checked the contained files and
directories will be prepended in their respective queues. When this
queue is empty, a single token from the _next_ queue is processed.

The last queue and again the _next_ queue after the directory queue is
the modules queue. When all files from the previous file have been
inspected, the next module then does its part and this continues until
all modules are finished. At this point the indexer quits. IT should be
noted here, the indexer is an impermanent entity. It only survives to
process work given to it.


The Problem
===========

The question is, should the daemon do some of this work? The issue here
for the daemon is that what it does is highly specific to "files" only.
It doesn't know anything about instant messaging files, locations, what
should be ignored, what should be monitored, etc.

When running the indexer right now, it sits at about 25%->33% in the
background indexing files (on my laptop), on my desktop, it can index my
140k files in about 130 seconds using no throttling and the system is
very usable during this time (and we haven't optimised anything yet
either). The daemon, however, does absolutely nothing after the initial
10-15 seconds (which is how long it takes to set up 6500 monitors and
get all 140k files in my home directory 30k of which have been ignored
as being unsuitable). So the statistics look good, but the daemon can do
more and should be doing things like monitoring the desktop file
directory so we know when applications are added, removed or updated.

To do this, we have been thinking about how best to design the
indexer/daemon work load so it is most efficient.


The How
=======

So after speaking with Carlos some more about this, the basic idea we
had was to make the indexer JUST index.

To do this means the modules need to be shared. This is so that the
indexer can get each module to index files the way it knows how to index
and so the daemon can request locations to monitor and crawl. The idea
being that the daemon crawls the files and sends all files and
directories (we currently don't send directories, just files) to the
indexer. The indexer needs both files and directories to add these to
the database.

We can take this one step further. We can even have the daemon check in
the database before sending files to the indexer to make sure we are not
generating extra work unnecessarily. This is something we don't do at
all yet, but is planned.


The Conclusion
==============

This work is mostly done right now. It is merely a case of moving the
architecture around a bit and moving code between processes. But is this
the right approach, what do you think? Comments welcome!


-- 
Regards,
Martyn

Follow-Ups:
- Re: [Tracker] Tracker daemon/indexer responsibilities
  - From: Jamie McCracken
- Re: [Tracker] Tracker daemon/indexer responsibilities
  - From: Jamie McCracken

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]