Re: Interesting Post on Tracker
- From: Joe Shaw <joeshaw novell com>
- To: Kevin Kubasik <kevin kubasik net>
- Cc: dashboard-hackers <dashboard-hackers gnome org>
- Subject: Re: Interesting Post on Tracker
- Date: Wed, 11 Oct 2006 15:58:05 -0400
Hi,
On Wed, 2006-10-04 at 22:13 -0400, Kevin Kubasik wrote:
> I saw this post syndicated on Planet Gnome and thought that it deserves
> some attention. While it doesn't directly attack beagle, its 'points' do
> seem to hit a little close to home on beagles weaker (er) spots, a good
> read.
Yes, the digs are thinly veiled.
> I do encourage comment and opinions back to the list (as opposed to on
> his blog) so we can try to learn from tracker and see what works and
> doesn't.
After I read this, I decided to give Tracker another try. I could never
get it to build when it was using mysql, and it still didn't build out
of the box for me, but I was able to hack a few things and get it going.
There are a number of things in the post which seem to be exaggerated:
* Mail indexing doesn't work at all; I got errors whenever I
tried to turn on Evolution mail indexing. Looking at the code,
the current implementation is far too naive and if it did work
wouldn't be useful. The names of the mailboxes are currently
hardcoded to (for Evo) "Inbox" and "Sent" and there is no
support for anything but mboxes. There isn't any logic to map a
message to an Evolution-understandable URI, so it is not
possible to open mail hits. There is a lot of work to be done
in this area.
* I didn't strictly measure indexing time, but it didn't feel
any faster indexing my data than Beagle does. Until Tracker as
more coverage of indexable data, this probably isn't a relevant
or fair comparison.
* The memory usage is great, but it's not at the 3mb level.
While indexing for me, it seemed to hover around the 7-9mb
level. In any case, still quite a bit better than Beagle.
* The API suggests that you can't search both the sqlite DB and
the text index at the same time, which means that implementation
details are pushed out onto the user, or at least onto a saavy
programmer. It doesn't seem possible to search for "eggplant
veggie" where "eggplant" is in the text content and "veggie" is
external metadata like a tag.
* The Pango word breaking he references is commented out as
being too slow. Lucene already handles CJK word breaking.
* The only stemmer provided is English. The stemmer uses the
same well-known Porter stemming algorithm that is already used
inside Lucene. Also, the license of the snowball stemmer
appears to be old-style BSD so it would be incompatible with GPL
applications.
Other notes:
* Using QDBM as the text indexer is an interesting idea. It is
a lot lower-level than Lucene and probably would not be
well-suited to Beagle's use because we store documents rather
than just an ID to look up in a database. The ability to search
both text and metadata makes a move to this system inefficient.
It may make more sense to switch to something Lucene-like like
Ferret, which is written in C and purportedly gives a
performance improvement.
* The benchmarks cited about QDBM are revised in a followup
article, and the slowness of Lucene is often found to be due to
JVM warmup time.
* Tracker is really well optimized for returning URIs. The
Beagle search APIs return a full "Hit" object which contains all
the metadata for a document. In certain cases you just want a
URI and we should probably expose an API for that, which will be
substantially faster.
* The low-level components in Lucene are pretty well-tested
upstream in both the Java and .Net versions. From a Beagle
standpoint, however, we could do well to have comprehensive test
suits. We have some testing tools, but the whole area could use
a lot of improvement. For example, version 0.2.9 shipped with a
nasty bug in which removal notifications weren't being sent to
clients. Despite my test runs, the tools didn't catch this.
* There are still quite a few bugs; the daemon would just die
with no error message or anything quite often in the middle of
indexing. It never made it fully through.
* Tracker uses a lot of CPU. I have a dual-CPU box so tracker's
CPU usage was often above 100% and was pretty consistently at
70%. If it has throttling like Beagle, it doesn't work nearly
as well. On the other hand, I didn't have any documents that
caused it to spin at 100% CPU like Beagle sometimes does.
* My system got progressively slower as Tracker indexed. I
didn't investigate this, but when I returned to my machine after
letting it index for a while, my system was noticeably slower; I
was logging memory usage while it was running, however, and it
never seemed to get out of control. Not sure what is going on
there.
Anyway, that's my rundown of things. Basically Beagle's tasks are
unchanged: we need to rework the indexing to better handle user-supplied
metadata, we need to consolidate indexes into a fixed number rather than
one-per-backend to help reduce memory usage, and we need to focus on
fixing bugs in filters and backends so that our indexing process is more
robust.
We had a hackfest at the Boston GNOME summit with myself, Fredrik, Bera,
Daniel, and others. I'll send a follow-up email about that.
Thanks,
Joe
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]