Re: Interesting Post on Tracker

From: Joe Shaw <joeshaw novell com>
To: Kevin Kubasik <kevin kubasik net>
Cc: dashboard-hackers <dashboard-hackers gnome org>
Subject: Re: Interesting Post on Tracker
Date: Wed, 11 Oct 2006 15:58:05 -0400
Hi,

On Wed, 2006-10-04 at 22:13 -0400, Kevin Kubasik wrote:
> I saw this post syndicated on Planet Gnome and thought that it deserves
> some attention. While it doesn't directly attack beagle, its 'points' do
> seem to hit a little close to home on beagles weaker (er) spots, a good
> read.

Yes, the digs are thinly veiled.

> I do encourage comment and opinions back to the list (as opposed to on
> his blog) so we can try to learn from tracker and see what works and
> doesn't.

After I read this, I decided to give Tracker another try.  I could never
get it to build when it was using mysql, and it still didn't build out
of the box for me, but I was able to hack a few things and get it going.

There are a number of things in the post which seem to be exaggerated:

        * Mail indexing doesn't work at all; I got errors whenever I
        tried to turn on Evolution mail indexing.  Looking at the code,
        the current implementation is far too naive and if it did work
        wouldn't be useful.  The names of the mailboxes are currently
        hardcoded to (for Evo) "Inbox" and "Sent" and there is no
        support for anything but mboxes.  There isn't any logic to map a
        message to an Evolution-understandable URI, so it is not
        possible to open mail hits.  There is a lot of work to be done
        in this area.

        * I didn't strictly measure indexing time, but it didn't feel
        any faster indexing my data than Beagle does.  Until Tracker as
        more coverage of indexable data, this probably isn't a relevant
        or fair comparison.
        
        * The memory usage is great, but it's not at the 3mb level.
        While indexing for me, it seemed to hover around the 7-9mb
        level.  In any case, still quite a bit better than Beagle.
        
        * The API suggests that you can't search both the sqlite DB and
        the text index at the same time, which means that implementation
        details are pushed out onto the user, or at least onto a saavy
        programmer.  It doesn't seem possible to search for "eggplant
        veggie" where "eggplant" is in the text content and "veggie" is
        external metadata like a tag.
        
        * The Pango word breaking he references is commented out as
        being too slow.  Lucene already handles CJK word breaking.
        
        * The only stemmer provided is English.  The stemmer uses the
        same well-known Porter stemming algorithm that is already used
        inside Lucene.  Also, the license of the snowball stemmer
        appears to be old-style BSD so it would be incompatible with GPL
        applications.
        
Other notes:

        * Using QDBM as the text indexer is an interesting idea.  It is
        a lot lower-level than Lucene and probably would not be
        well-suited to Beagle's use because we store documents rather
        than just an ID to look up in a database.  The ability to search
        both text and metadata makes a move to this system inefficient.
        It may make more sense to switch to something Lucene-like like
        Ferret, which is written in C and purportedly gives a
        performance improvement.
        
        * The benchmarks cited about QDBM are revised in a followup
        article, and the slowness of Lucene is often found to be due to
        JVM warmup time.
        
        * Tracker is really well optimized for returning URIs.  The
        Beagle search APIs return a full "Hit" object which contains all
        the metadata for a document.  In certain cases you just want a
        URI and we should probably expose an API for that, which will be
        substantially faster.
        
        * The low-level components in Lucene are pretty well-tested
        upstream in both the Java and .Net versions.  From a Beagle
        standpoint, however, we could do well to have comprehensive test
        suits.  We have some testing tools, but the whole area could use
        a lot of improvement.  For example, version 0.2.9 shipped with a
        nasty bug in which removal notifications weren't being sent to
        clients.  Despite my test runs, the tools didn't catch this.
        
        * There are still quite a few bugs; the daemon would just die
        with no error message or anything quite often in the middle of
        indexing.  It never made it fully through.
        
        * Tracker uses a lot of CPU.  I have a dual-CPU box so tracker's
        CPU usage was often above 100% and was pretty consistently at
        70%.  If it has throttling like Beagle, it doesn't work nearly
        as well.  On the other hand, I didn't have any documents that
        caused it to spin at 100% CPU like Beagle sometimes does.
        
        * My system got progressively slower as Tracker indexed.  I
        didn't investigate this, but when I returned to my machine after
        letting it index for a while, my system was noticeably slower; I
        was logging memory usage while it was running, however, and it
        never seemed to get out of control.  Not sure what is going on
        there.
        
Anyway, that's my rundown of things.  Basically Beagle's tasks are
unchanged: we need to rework the indexing to better handle user-supplied
metadata, we need to consolidate indexes into a fixed number rather than
one-per-backend to help reduce memory usage, and we need to focus on
fixing bugs in filters and backends so that our indexing process is more
robust.

We had a hackfest at the Boston GNOME summit with myself, Fredrik, Bera,
Daniel, and others.  I'll send a follow-up email about that.

Thanks,
Joe
Follow-Ups:
- Re: Interesting Post on Tracker
  - From: Kevin Kubasik
References:
- Interesting Post on Tracker
  - From: Kevin Kubasik
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]