Re: [Tracker] Tracker daemon/indexer responsibilities



On Wed, 2008-06-25 at 16:46 +0100, Martyn Russell wrote:
Hi all,

Thinks are all going well on the indexer-split branch, however, it
occurred to me that daemon has its work duplicated in the indexer. We
need to resolve where the responsibility lies for the daemon and the
indexer.


The Modules
===========

First about the modules. So we have these modules, they all share a
common API. This API includes functions to:

- Index content
- Get directories
- Know if a file or directory should be ignored

The modules include:

- applications
- files
- gaim-conversations
- firefox-history

The idea is, each of these modules know how to index, locate and ignore
particular files and directories pertaining to their specific arena
(i.e. instant messaging, browsing, applications, etc).

thats fine



The Daemon
==========

So, recently in the daemon, I just finished writing the code to crawl
the file system and queue ALL files in $HOME or where ever the config
says we should index files from. The daemon also sets up monitors for
each directory it finds along the way. This is all done using the new
GIO functions and works nicely. The files found are then sent in chunks
to the indexer to process. This includes monitor updates to files.


for non-inotify i take it we will not use GIO? Do we have enough info to
manage things this way? 

daemon should be able to get/set tags and user/app metadata not from the
index as well

I think we need a libtrackermetadata to share the metadata stuff between
indexer and daemon



The Indexer
===========

The indexer process works like a state machine with 3 queues for:

- Files
- Directores
- Modules

The files queue has the highest priority, individual files are stored
here, waiting for metadata extraction, etc... files are taken one by one
in order to be processed, when this queue is empty, a single token from
the _next_ queue is processed.

The directories queue is the _next_ queue. Directories are waiting for
inspection here. When a directory is checked the contained files and
directories will be prepended in their respective queues. When this
queue is empty, a single token from the _next_ queue is processed.

The last queue and again the _next_ queue after the directory queue is
the modules queue. When all files from the previous file have been
inspected, the next module then does its part and this continues until
all modules are finished. At this point the indexer quits. IT should be
noted here, the indexer is an impermanent entity. It only survives to
process work given to it.


Not quite what I had in mind - the indexer should be dumb and fed stuff
to index by the daemon. the exception is directories which need to be
recursively scanned (not sure we need separate queues for them)





The Problem
===========

The question is, should the daemon do some of this work? The issue here
for the daemon is that what it does is highly specific to "files" only.
It doesn't know anything about instant messaging files, locations, what
should be ignored, what should be monitored, etc.

When running the indexer right now, it sits at about 25%->33% in the
background indexing files (on my laptop), on my desktop, it can index my
140k files in about 130 seconds using no throttling and the system is
very usable during this time (and we haven't optimised anything yet
either). The daemon, however, does absolutely nothing after the initial
10-15 seconds (which is how long it takes to set up 6500 monitors and
get all 140k files in my home directory 30k of which have been ignored
as being unsuitable). So the statistics look good, but the daemon can do
more and should be doing things like monitoring the desktop file
directory so we know when applications are added, removed or updated.


absolutely - all watching should be done by daemon

the indexer should be told what service is being indexed when its passed
a url. the daemon will know this a sit keeps track of which directories
belong to which service


To do this, we have been thinking about how best to design the
indexer/daemon work load so it is most efficient.


The How
=======

So after speaking with Carlos some more about this, the basic idea we
had was to make the indexer JUST index.

thats was my plan :)


To do this means the modules need to be shared. This is so that the
indexer can get each module to index files the way it knows how to index
and so the daemon can request locations to monitor and crawl. The idea
being that the daemon crawls the files and sends all files and
directories (we currently don't send directories, just files) to the
indexer. The indexer needs both files and directories to add these to
the database.

We can take this one step further. We can even have the daemon check in
the database before sending files to the indexer to make sure we are not
generating extra work unnecessarily. This is something we don't do at
all yet, but is planned.

makes sense



The Conclusion
==============

This work is mostly done right now. It is merely a case of moving the
architecture around a bit and moving code between processes. But is this
the right approach, what do you think? Comments welcome!


I think you are on the right track (I have not checked all the source
changes though - just a quick scan)

the main problem between indexer and daemon is which should handle
recursive indexing of folders (particularly during first time index) -
this can be done either way. I dont know which is better

for performance reasons the daemon should not be niced at all so its
important it does not consume too much cpu or disk I/O whilst the
indexer does all the IO/cpu intensive stuff and will be niced +19 and
ioniced as much as possible

handling file moves also needs some discussion about whether the
indexer/daemon should do this

Im open to which way you go here - there may also be issues with NFS
with slow performance or broken file locking so we need to be careful

architecturally you just need a libtrackermetadata for the common 
metadata routines between indexer/daemon (unless these are somewhere else?)



jamie




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]