State of the Pooch
- From: Joe Shaw <joeshaw novell com>
- To: dashboard-hackers gnome org
- Subject: State of the Pooch
- Date: Thu, 16 Nov 2006 15:53:37 -0500
Hi,
I can't believe it's been 18 months since the last State of the Pooch
email:
http://mail.gnome.org/archives/dashboard-hackers/2005-May/msg00011.html
It's fun to go back and reread it to see all of the stuff we've
accomplished in that time. A follow-up has been far too long in coming.
Anyway, the purpose of this mail is to fill everyone in on the stuff I
and others are doing, and hopefully call to action people who are
interested in hacking on Beagle but don't know where to start.
* Unified indexes
This is a big project I have been working on the last couple of
weeks. The gist of it is that Beagle today uses two Lucene
indexes for every backend, and we now have (by my count) 17
backends. This is a waste of disk space and memory, and slows
down overall search performance. Moreover, these indexes have a
very uneven number of items (with many having zero), which also
slows down search performance on the bigger ones.
This work will result in a fixed number of Lucene indexes
regardless of how many backends there are, and have a relatively
even distribution of documents contained within them.
This work is currently being done on the
beagle-unified-indexes-branch in CVS. I can go into a lot more
detail on this if people are interested.
* Memory usage
The other big thing I've been working on is reducing memory
usage. I've posted here and blogged about it some in the past,
and it continues to be the biggest issue in Beagle and its
adoption thus far. Fortunately there is a new Mono profiler
out, called heap-shot:
http://primates.ximian.com/~lluis/blog/pivot/entry.php?id=56
This, along with heap-buddy, are invaluable tools. I've already
identified a few more "hotspots" that we can improve.
* Generics and .NET 2.0
Somewhat related to the memory usage, we will probably be
switching to using Mono's .NET 2.0 class libraries soon and
starting to integrate generics into Beagle code. This is
because Mono 1.1.18 declared the generics compiler stable, and a
move to generics will also help reduce our memory usage. In
addition, many of the new 2.0 classes are more efficient than
their 1.x counterparts.
* Showing status on the state of the index
One common question we get on IRC (and sometimes on-list) is
that people are searching for something but they can't find it
because Beagle hasn't indexed it yet, and gives no indication
that the initial index is still happening. There is some
infrastructure for this in place now, but only the Evolution
mail backend uses it. This will be fleshed out more (especially
for files), so that the UI makes it clear to users that the
initial indexing process has not yet finished.
* Automatic document language detection
Paul Betts is working on code that will allow Beagle to
automatically detect what language a document is in, so that we
can do proper analysis on that document. Right now we assume
everything is English, and apply English rules for stemming.
This will allow for us to search for documents based on language
and handle language-specific search terms.
Paul tells me he has most of the detection code finished, he
needs to hook it up into Beagle. We'll also probably need to
bring in the Snowball stemmers to handle the document language
correctly.
* Networked searches
Fredrik started the work of integrating Kyle and Alexis's Summer
of Code work on the networked searches during the GNOME summit
and I know he's made good progress on it. I'm hoping this email
will guilt him into finishing that work or at least giving us a
status update on that. :)
* Spelling suggestions
This summer Fredrik also did a proof of concept implementation
for giving spelling suggestions on searches. He opened a
bugzilla bug about it and attached his work here:
http://bugzilla.gnome.org/show_bug.cgi?id=353534
and you can see a screenshot of it in action here:
http://bugzilla.gnome.org/attachment.cgi?id=72008&action=view
Fred highlighted a few problems with his implementation and
Kevin also pointed out some issues he had. It would be great it
someone interested in this took this project on.
* Handling crashes in the index helper better.
We have a problem right now with certain files -- usually
Microsoft Word -- crashing the index helper process. Because
Beagle is incredibly conservative about corrupting the index,
after this happens we purge the index and start reindexing.
Obviously this sucks if you have one of those crashy documents.
We've tried to push these issues upstream to the wv1 developers,
but the bugs basically have been ignored, so an upstream
solution doesn't seem forthcoming.
The likelihood of a corrupt index in this case is extremely
unlikely, so what we should probably do instead is not purge the
index and be smarter about detecting a crash so that when we
push a batch of files from the daemon to the helper process, we
can identify the crashy file, mark it, and move on. Yes, the
helper will still crash -- we can't avoid that -- but we will
become more robust to those problematic files.
* Removable media
Beagle needs to support indexing of data on removable media.
There isn't any support for this right now. I don't really have
in-depth details about this, but it's on the radar and (sadly)
pretty far down on the TODO.
* Thunderbird memory usage
The Thunderbird backend is a bit of a hog right now. This is:
http://bugzilla.gnome.org/show_bug.cgi?id=355549
Kevin has been doing some work on this, but we really need
people to take a look at this. I know that this backend has
been disabled by default in Fedora Core 6.
* The return of D-Bus
In the last State of the Pooch I talked about removing D-Bus
from Beagle due to its unsuitability for Beagle and the lack of
stability in the Mono bindings. Now that there is a completely
new, all-managed D-Bus implementation, we should revisit that
decision and consider adding a D-Bus search API.
A proof of concept on this would be very helpful, and could be
done as a completely standalone project. Essentially one could
write a proxy in C# which exposed a D-Bus search interface, took
the requests, and then used the C# Beagle APIs to run the search
and return the results. (Make sure to implement live queries!)
I think that's it! There is always work to be done in supporting new
file formats through filters and data sources through backends, as well
as improving our documentation on the Wiki. We've done a great job
since the last State of the Pooch and while I hope it's not quite as
long until the next one, we can do this great work together.
Thanks,
Joe
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]