State of the Pooch
- From: "Joe Shaw" <joe joeshaw org>
- To: dashboard-hackers gnome org
- Subject: State of the Pooch
- Date: Thu, 25 Oct 2007 15:37:48 -0400
Hi,
It's that time again. Time for a "State of the Pooch" email to let
the community know how we're doing with Beagle and where we're going.
Previous addresses are here:
http://mail.gnome.org/archives/dashboard-hackers/2006-November/msg00064.html
http://mail.gnome.org/archives/dashboard-hackers/2005-May/msg00011.html
A lot of the stuff in the previous SotP, roughly a year ago, still
applies in some way today.
* dBera is now co-maintainer
I'm happy to announce that Debajyoti Bera, who has easily written
more code for Beagle in the last year than anyone else, has become
a co-maintainer of the project. This is great news because he has
solid knowledge of the codebase and is the first non-Novell
maintainer of the code.
dBera will still be mostly coding, and he will have equally final
say about patches, technical direction, etc. with me. He may also
do releases from time to time. :)
* The never-ending quest for 0.3.0
Work continues in trying to make a great 0.3.0 release, and in the
meantime we're pushing out 0.2.x maintenance releases. I'd love
it if people could be regularly running from SVN trunk so that we
can stress test a lot of the features that I'll mention below and
get a 0.3.0 release out there that the less adventurous users out
there can enjoy.
* Networked searches
Thanks to work from Lukas Lipka and Fredrik Hedberg, we've
(finally!) merged the network search code from Alexis and Kyle's
Summer of Code projects from last year into the codebase. The
Beagle daemon now provides a backend which can query other Beagle
instances. There is some preliminary support for Avahi and
autodiscovery of other Beagle daemons on the network, but that's
currently disabled while some stability bugs are worked out.
There's still a lot of work to be done here in terms of how we
access non-file resources on remote machines, security concerns,
etc., so this code should be considered experimental for now. You
can turn it on by toggling the networked setting in one of the
configuration tools.
* Web user interface
dBera and Nirbheek Chauhan have been working on a Web interface to
Beagle. In addition to search results, index information, daemon
status, and the ability to shut down the daemon are all possible
through this UI. The Web UI relies on the network infrastructure.
It's not meant to be a replacement to beagle-search, but it is
nice in that it is easily skinable, will be easy to view email in
the browser, etc.
http://beagle-project.org/Beagle_Webinterface
* New configuration system
dBera has been working on a new configuration system to handle two
shortcomings in the current system: (1) Allowing a system-wide
configuration file, so sysadmins can apply policy to all users and
(2) allow plugins (like filters and backends) to store and
retreive their own configuration options. The configuration
manager loads the global config file (in /etc/beagle/config/) and
the local one, merging the two. This also fixes the current
problem where all settings were saved in the user's config file,
not just the ones that are changed from the default.
* Xesam support
Arun Raghavan has written an adapter to Beagle which implements
the Xesam freedesktop.org search spec, and the reference tools run
against it. Exactly how this will be integrated into the code is
unclear at this point, however. As of right now, there are no
fully fledged search tools which use the Xesam API, so we're not
ready to commit to the APIs natively. Also, integrating D-Bus
back into Beagle is a worthy goal, but will require quite a bit of
work.
* Firefox extension
More great Summer of Code work, the new Firefox extension has been
merged into the source tree. In addition to indexing web pages as
you fiew them, you can now index web pages, links, and images on
demand. The settings UI is greatly improved as well.
http://dtecht.blogspot.com/2007/08/hey-firefox-beagle-this-now.html
* Thunderbird extension
Another SoC project, we decided to take a different approach from
the previous Thunderbird work and the Evolution backend, and add
support for Thunderbird through an extension. This extension is
responsible for sending emails to the running Beagle daemon for
indexing. While you have to be running Thunderbird for this to
work, it's fast and much, much friendlier on the system resources.
* Experimental RDF branch
This is an experimental branch which will export an RDF service
that clients can query. This is something that has been planned
from the beginning in Beagle, but we've never gotten around to it
until now. As data is indexed, an RDF store will be created
alongside the text index, and more complex relationships between
the data can be examined.
* Lots of work to be done
My list of things I would like to see get some attention:
- Rewrite of the file system backend. I've mentioned this on the
list before, but I wanted to give a little more info. When we
designed the file system backend, we decided to largely separate
files and folders from their file system hierarchy. This
allowed us to handle moves of an infinite number of files
underneath a folder instantaneously. However, in doing so we
had to trade off the ability to search for files underneath a
given folder. In retrospect, I think this was the wrong
decision. In addition to adding a ton of complexity to the
code, it has a major negative effect on memory usage and
prohibits users from doing an extremely common type of search.
I feel that the file system backend has to be rewriten much more
simply, with the parent-child relationship of files indexed and
easily searchable. This will make large moves inefficient, but
will make a more common use case possible. (And moving large
numbers of files is what I call a "thundering herd" problem, and
one that has to be dealt with anyway, because things like "rm
-rf" already trigger them.)
- D-Bus back in Beagle as the primary message system. I wrote the
current serialized XML format a couple of years ago now and
while it's served us well, I think that junking that code and
switching back to D-Bus is the right thing to do. D-Bus has
matured and stablized considerably, and we now have a totally
native C# implementation of the protocol. In the end, I think
it will be quite a bit faster than the automatic XML parsing
that happens today.
- Removable media. It came up again fairly recently on the list,
but I'd like to see some sort of integration of Beagle with HAL
so that many removable devices can be indexed automatically, and
make it possible to retrieve information about files from
offline storage like CDs.
- Test suites. We had a Novell-internal test suite for many file
formats for a while, but the majority of those files I couldn't
distribute. We're gradually building up a good set of files in
SVN to test, but we really need people to start writing test
harnesses for those files and regression tests for individual
subsystems. This work will help stability and development tremendously.
* Miscellaneous other nicities.
- Reworking of child indexables (ie, PDF inside a ZIP inside an
email): These are faster and use less memory than before.
- Taglib-sharp: Use this, an actively developed and maintained
library, for extracting metadata from audio files.
- Snowball analyzers: The first step toward language based
indexing.
- Sqlite3 and Mono.Data.Sqlite: In 0.3.0 we will support only
sqlite version 3, and use the upstream, maintained Mono APIs for
this, which should greatly reduce bugs.
- Nautilus metadata: Emblems, notes, and other metadata that are
set through GNOME's Nautilus file manager are now indexed. This
was a proof of concept implementation for how to extract
metadata from external sources; there is also an API for this
that F-Spot uses.
- TeX filter: One of the most oft-requested features.
- TextCache: We were wasting TONS of disk space with the way
things were laid out before. Thanks to dBera and Arun, we now
have a hybrid file system and database system for much more
optimal storage of text data from complex files.
http://dtecht.blogspot.com/2007/10/i-saved-80mb.html
- Snippets: The gross way of getting HTML snippets back is fixed.
You can now request the size of the snippet you want and get
structured data back so that it's easier to present and doesn't
require an HTML widget or a regexp to transform the output. You
also now get the line of the file that the snippet is on, the
sentence before the match, and the sentence after the match.
- New query API to retrieve metadata about a particular URI,
including the complete cached text rather than a snippet.
I think that's most of the big stuff! As I always do, I am sure I
forgot something. But hopefully it won't be another 11 months before
the next one of these emails. Your attention and help are
appreciated!
Thanks,
Joe
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]