Re: [Tracker] [PATCH] "Daemonize" metadata extractor

From: Jamie McCracken <jamiemcc blueyonder co uk>
To: Carlos Garnacho <carlos imendio com>
Cc: tracker-list gnome org
Subject: Re: [Tracker] [PATCH] "Daemonize" metadata extractor
Date: Tue, 04 Mar 2008 10:36:37 -0500


On Tue, 2008-03-04 at 15:19 +0100, Carlos Garnacho wrote:

Hi!,

On Fri, 2008-02-29 at 09:10 -0500, Jamie McCracken wrote:

On Fri, 2008-02-29 at 11:34 +0100, Carlos Garnacho wrote:

Hi!,

I've attached a patch in bug #519337 to keep the extractor alive between
operations. This greatly improves performance, as it avoids having to
spawn/initialize the extractor constantly for each new file. With the
patch, the extractor shuts down by itself after 30 seconds of
inactivity, any testing is appreciated.

Besides, I've been thinking a bit in this subject. Right now trackerd
waits synchronously for the metadata extractor output (and the same
happens for thumbnailing, even when such data isn't immediately
necessary), so only 1 file is processed at the same time. 

Has there been any thinking/work on making that parallelizable? I'm sure
there'd be performance improvements if there was a pool of extractors
which asynchronously processed a queue of filenames.


yeah although its tricky with threads (synchronisation and deadlock
issues)


I didn't plan to use threads here, I've developed a small test extractor
[1] that spawns several extractors and manages them asynchronously
through watches, it requires the patched tracker-extractor from bug
#519337. You can run it with:

./test-extract [num-extractors] [path-to-extract]

Being a test, it just gets metadata from mp3 files, but the
tracker-extractor-pool.[ch] files can be easily adapted to tracker
needs.


bear in mind tracker is a differential indexer so when indexing a new
file we need all the metadata before saving it - we must not index
partially and then complete later on as thats inefficient with our
design and would prolong the sqlite transactions which would prevent
searches from running

It would make things a lot more complex unless I have misunderstood your
plans?

because we want to index lots of docs within an sqlite transaction its
likely we wont use threads for indexing in any event (as the threads
would block each other as sqlite blocks read and writes from others when
in a transaction). We could get round this by only having one thread do
the saving to sqlite but it adds more complexity and more potential
memory usage from queueing up the docs with their metadata


<snip>


anyway to cut a long story short, daemonizing tracker-extract is not
the
way to go but rather to embed common and reliable (Eg not crash prone)
formats in a tracker-file-indexer daemon. It should use dbus of course
for flexibility. It could be threaded as it would be less complex than
trackerd is at the moment


What would be the criteria for marking a extractor as reliable? I'd be
extra-careful there, extractors deal with unknown data. Also, threading
brings other complexities, like the underlying libraries not being
thread-safe, having extractors that resort to command line calls not
thread aware at all, etc...


AFAIK gstreamer is threadsafe and only music files would probably be
done in-process

all other formats would have to be done out of process although it would
be nice to do images in-process (as music and image files are the most
likely to be present in large numbers)

jamie

Follow-Ups:
- Re: [Tracker] [PATCH] "Daemonize" metadata extractor
  - From: Carlos Garnacho

References:
- Re: [Tracker] [PATCH] "Daemonize" metadata extractor
  - From: Carlos Garnacho

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]