State of the Pooch

From: Joe Shaw <joeshaw novell com>
To: dashboard-hackers gnome org
Subject: State of the Pooch
Date: Thu, 16 Nov 2006 15:53:37 -0500
Hi,

I can't believe it's been 18 months since the last State of the Pooch
email:

        http://mail.gnome.org/archives/dashboard-hackers/2005-May/msg00011.html

It's fun to go back and reread it to see all of the stuff we've
accomplished in that time.  A follow-up has been far too long in coming.

Anyway, the purpose of this mail is to fill everyone in on the stuff I
and others are doing, and hopefully call to action people who are
interested in hacking on Beagle but don't know where to start.

* Unified indexes

        This is a big project I have been working on the last couple of
        weeks.  The gist of it is that Beagle today uses two Lucene
        indexes for every backend, and we now have (by my count) 17
        backends.  This is a waste of disk space and memory, and slows
        down overall search performance.  Moreover, these indexes have a
        very uneven number of items (with many having zero), which also
        slows down search performance on the bigger ones.
        
        This work will result in a fixed number of Lucene indexes
        regardless of how many backends there are, and have a relatively
        even distribution of documents contained within them.
        
        This work is currently being done on the
        beagle-unified-indexes-branch in CVS.  I can go into a lot more
        detail on this if people are interested.
        
* Memory usage

        The other big thing I've been working on is reducing memory
        usage.  I've posted here and blogged about it some in the past,
        and it continues to be the biggest issue in Beagle and its
        adoption thus far.  Fortunately there is a new Mono profiler
        out, called heap-shot:
        
                http://primates.ximian.com/~lluis/blog/pivot/entry.php?id=56
                
        This, along with heap-buddy, are invaluable tools.  I've already
        identified a few more "hotspots" that we can improve.
        
* Generics and .NET 2.0

        Somewhat related to the memory usage, we will probably be
        switching to using Mono's .NET 2.0 class libraries soon and
        starting to integrate generics into Beagle code.  This is
        because Mono 1.1.18 declared the generics compiler stable, and a
        move to generics will also help reduce our memory usage.  In
        addition, many of the new 2.0 classes are more efficient than
        their 1.x counterparts.
        
* Showing status on the state of the index

        One common question we get on IRC (and sometimes on-list) is
        that people are searching for something but they can't find it
        because Beagle hasn't indexed it yet, and gives no indication
        that the initial index is still happening.  There is some
        infrastructure for this in place now, but only the Evolution
        mail backend uses it.  This will be fleshed out more (especially
        for files), so that the UI makes it clear to users that the
        initial indexing process has not yet finished.
        
* Automatic document language detection

        Paul Betts is working on code that will allow Beagle to
        automatically detect what language a document is in, so that we
        can do proper analysis on that document.  Right now we assume
        everything is English, and apply English rules for stemming.
        
        This will allow for us to search for documents based on language
        and handle language-specific search terms.
        
        Paul tells me he has most of the detection code finished, he
        needs to hook it up into Beagle.  We'll also probably need to
        bring in the Snowball stemmers to handle the document language
        correctly.
        
* Networked searches

        Fredrik started the work of integrating Kyle and Alexis's Summer
        of Code work on the networked searches during the GNOME summit
        and I know he's made good progress on it.  I'm hoping this email
        will guilt him into finishing that work or at least giving us a
        status update on that. :)
        
* Spelling suggestions

        This summer Fredrik also did a proof of concept implementation
        for giving spelling suggestions on searches.  He opened a
        bugzilla bug about it and attached his work here:
        
                http://bugzilla.gnome.org/show_bug.cgi?id=353534
                
        and you can see a screenshot of it in action here:
        
                http://bugzilla.gnome.org/attachment.cgi?id=72008&action=view
                
        Fred highlighted a few problems with his implementation and
        Kevin also pointed out some issues he had.  It would be great it
        someone interested in this took this project on.
        
* Handling crashes in the index helper better.

        We have a problem right now with certain files -- usually
        Microsoft Word -- crashing the index helper process.  Because
        Beagle is incredibly conservative about corrupting the index,
        after this happens we purge the index and start reindexing.
        Obviously this sucks if you have one of those crashy documents.
        We've tried to push these issues upstream to the wv1 developers,
        but the bugs basically have been ignored, so an upstream
        solution doesn't seem forthcoming.
        
        The likelihood of a corrupt index in this case is extremely
        unlikely, so what we should probably do instead is not purge the
        index and be smarter about detecting a crash so that when we
        push a batch of files from the daemon to the helper process, we
        can identify the crashy file, mark it, and move on.  Yes, the
        helper will still crash -- we can't avoid that -- but we will
        become more robust to those problematic files.
        
* Removable media

        Beagle needs to support indexing of data on removable media.
        There isn't any support for this right now.  I don't really have
        in-depth details about this, but it's on the radar and (sadly)
        pretty far down on the TODO.
        
* Thunderbird memory usage

        The Thunderbird backend is a bit of a hog right now.  This is:
        
                http://bugzilla.gnome.org/show_bug.cgi?id=355549
                
        Kevin has been doing some work on this, but we really need
        people to take a look at this.  I know that this backend has
        been disabled by default in Fedora Core 6.
        
* The return of D-Bus

        In the last State of the Pooch I talked about removing D-Bus
        from Beagle due to its unsuitability for Beagle and the lack of
        stability in the Mono bindings.  Now that there is a completely
        new, all-managed D-Bus implementation, we should revisit that
        decision and consider adding a D-Bus search API.
        
        A proof of concept on this would be very helpful, and could be
        done as a completely standalone project.  Essentially one could
        write a proxy in C# which exposed a D-Bus search interface, took
        the requests, and then used the C# Beagle APIs to run the search
        and return the results.  (Make sure to implement live queries!)
        
I think that's it!  There is always work to be done in supporting new
file formats through filters and data sources through backends, as well
as improving our documentation on the Wiki.  We've done a great job
since the last State of the Pooch and while I hope it's not quite as
long until the next one, we can do this great work together.

Thanks,
Joe
Follow-Ups:
- Re: State of the Pooch
  - From: Debajyoti Bera
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]