State of the Pooch

From: "Joe Shaw" <joe joeshaw org>
To: dashboard-hackers gnome org
Subject: State of the Pooch
Date: Thu, 25 Oct 2007 15:37:48 -0400
Hi,

It's that time again.  Time for a "State of the Pooch" email to let
the community know how we're doing with Beagle and where we're going.
Previous addresses are here:

    http://mail.gnome.org/archives/dashboard-hackers/2006-November/msg00064.html
    http://mail.gnome.org/archives/dashboard-hackers/2005-May/msg00011.html

A lot of the stuff in the previous SotP, roughly a year ago, still
applies in some way today.

* dBera is now co-maintainer

    I'm happy to announce that Debajyoti Bera, who has easily written
    more code for Beagle in the last year than anyone else, has become
    a co-maintainer of the project.  This is great news because he has
    solid knowledge of the codebase and is the first non-Novell
    maintainer of the code.

    dBera will still be mostly coding, and he will have equally final
    say about patches, technical direction, etc. with me.  He may also
    do releases from time to time. :)

* The never-ending quest for 0.3.0

    Work continues in trying to make a great 0.3.0 release, and in the
    meantime we're pushing out 0.2.x maintenance releases.  I'd love
    it if people could be regularly running from SVN trunk so that we
    can stress test a lot of the features that I'll mention below and
    get a 0.3.0 release out there that the less adventurous users out
    there can enjoy.

* Networked searches

    Thanks to work from Lukas Lipka and Fredrik Hedberg, we've
    (finally!) merged the network search code from Alexis and Kyle's
    Summer of Code projects from last year into the codebase.  The
    Beagle daemon now provides a backend which can query other Beagle
    instances.  There is some preliminary support for Avahi and
    autodiscovery of other Beagle daemons on the network, but that's
    currently disabled while some stability bugs are worked out.

    There's still a lot of work to be done here in terms of how we
    access non-file resources on remote machines, security concerns,
    etc., so this code should be considered experimental for now.  You
    can turn it on by toggling the networked setting in one of the
    configuration tools.

* Web user interface

    dBera and Nirbheek Chauhan have been working on a Web interface to
    Beagle.  In addition to search results, index information, daemon
    status, and the ability to shut down the daemon are all possible
    through this UI.  The Web UI relies on the network infrastructure.
    It's not meant to be a replacement to beagle-search, but it is
    nice in that it is easily skinable, will be easy to view email in
    the browser, etc.

    http://beagle-project.org/Beagle_Webinterface

* New configuration system

    dBera has been working on a new configuration system to handle two
    shortcomings in the current system: (1) Allowing a system-wide
    configuration file, so sysadmins can apply policy to all users and
    (2) allow plugins (like filters and backends) to store and
    retreive their own configuration options.  The configuration
    manager loads the global config file (in /etc/beagle/config/) and
    the local one, merging the two.  This also fixes the current
    problem where all settings were saved in the user's config file,
    not just the ones that are changed from the default.

* Xesam support

    Arun Raghavan has written an adapter to Beagle which implements
    the Xesam freedesktop.org search spec, and the reference tools run
    against it.  Exactly how this will be integrated into the code is
    unclear at this point, however.  As of right now, there are no
    fully fledged search tools which use the Xesam API, so we're not
    ready to commit to the APIs natively.  Also, integrating D-Bus
    back into Beagle is a worthy goal, but will require quite a bit of
    work.

* Firefox extension

    More great Summer of Code work, the new Firefox extension has been
    merged into the source tree.  In addition to indexing web pages as
    you fiew them, you can now index web pages, links, and images on
    demand.  The settings UI is greatly improved as well.

    http://dtecht.blogspot.com/2007/08/hey-firefox-beagle-this-now.html

* Thunderbird extension

    Another SoC project, we decided to take a different approach from
    the previous Thunderbird work and the Evolution backend, and add
    support for Thunderbird through an extension.  This extension is
    responsible for sending emails to the running Beagle daemon for
    indexing.  While you have to be running Thunderbird for this to
    work, it's fast and much, much friendlier on the system resources.

* Experimental RDF branch

    This is an experimental branch which will export an RDF service
    that clients can query.  This is something that has been planned
    from the beginning in Beagle, but we've never gotten around to it
    until now.  As data is indexed, an RDF store will be created
    alongside the text index, and more complex relationships between
    the data can be examined.

* Lots of work to be done

    My list of things I would like to see get some attention:

    - Rewrite of the file system backend.  I've mentioned this on the
      list before, but I wanted to give a little more info.  When we
      designed the file system backend, we decided to largely separate
      files and folders from their file system hierarchy.  This
      allowed us to handle moves of an infinite number of files
      underneath a folder instantaneously.  However, in doing so we
      had to trade off the ability to search for files underneath a
      given folder.  In retrospect, I think this was the wrong
      decision.  In addition to adding a ton of complexity to the
      code, it has a major negative effect on memory usage and
      prohibits users from doing an extremely common type of search.

      I feel that the file system backend has to be rewriten much more
      simply, with the parent-child relationship of files indexed and
      easily searchable.  This will make large moves inefficient, but
      will make a more common use case possible.  (And moving large
      numbers of files is what I call a "thundering herd" problem, and
      one that has to be dealt with anyway, because things like "rm
      -rf" already trigger them.)

    - D-Bus back in Beagle as the primary message system.  I wrote the
      current serialized XML format a couple of years ago now and
      while it's served us well, I think that junking that code and
      switching back to D-Bus is the right thing to do.  D-Bus has
      matured and stablized considerably, and we now have a totally
      native C# implementation of the protocol.  In the end, I think
      it will be quite a bit faster than the automatic XML parsing
      that happens today.

    - Removable media.  It came up again fairly recently on the list,
      but I'd like to see some sort of integration of Beagle with HAL
      so that many removable devices can be indexed automatically, and
      make it possible to retrieve information about files from
      offline storage like CDs.

    - Test suites.  We had a Novell-internal test suite for many file
      formats for a while, but the majority of those files I couldn't
      distribute.  We're gradually building up a good set of files in
      SVN to test, but we really need people to start writing test
      harnesses for those files and regression tests for individual
      subsystems.  This work will help stability and development tremendously.

* Miscellaneous other nicities.

    - Reworking of child indexables (ie, PDF inside a ZIP inside an
      email): These are faster and use less memory than before.

    - Taglib-sharp: Use this, an actively developed and maintained
      library, for extracting metadata from audio files.

    - Snowball analyzers: The first step toward language based
      indexing.

    - Sqlite3 and Mono.Data.Sqlite: In 0.3.0 we will support only
      sqlite version 3, and use the upstream, maintained Mono APIs for
      this, which should greatly reduce bugs.

    - Nautilus metadata: Emblems, notes, and other metadata that are
      set through GNOME's Nautilus file manager are now indexed.  This
      was a proof of concept implementation for how to extract
      metadata from external sources; there is also an API for this
      that F-Spot uses.

    - TeX filter: One of the most oft-requested features.

    - TextCache: We were wasting TONS of disk space with the way
      things were laid out before.  Thanks to dBera and Arun, we now
      have a hybrid file system and database system for much more
      optimal storage of text data from complex files.

      http://dtecht.blogspot.com/2007/10/i-saved-80mb.html

    - Snippets: The gross way of getting HTML snippets back is fixed.
      You can now request the size of the snippet you want and get
      structured data back so that it's easier to present and doesn't
      require an HTML widget or a regexp to transform the output.  You
      also now get the line of the file that the snippet is on, the
      sentence before the match, and the sentence after the match.

    - New query API to retrieve metadata about a particular URI,
      including the complete cached text rather than a snippet.

I think that's most of the big stuff!  As I always do, I am sure I
forgot something.  But hopefully it won't be another 11 months before
the next one of these emails.  Your attention and help are
appreciated!

Thanks,
Joe
Follow-Ups:
- Re: State of the Pooch
  - From: Debajyoti Bera
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]