Mail work



Hi,

I meant to send this out over the holiday, sorry it has taken so long.
It's long-winded, so if you just care about the consequences, scroll
toward the end.

As you probably read in Jon's 0.0.3 announcement, we had a meeting two
weeks ago to try to determine a good plan-of-attack for the project.  We
came up with a set of milestones, with the goal of shipping Beagle in
SUSE 9.3.

Without a doubt, the largest, heaviest, and slowest parts of Beagle thus
far has been the mail indexer.  And if you're anything like me, you have
much more email than any other kind of data, and accessing it quickly is
probably the most important thing when it comes to search.  To give you
some perspective on my own volume, I have 171,103 messages in a few
different mailboxes as of this writing, most of them from this year.  In
addition to that, I have 105,390 messages in my spam folder, although
that dates back to late 2002.  So I definitely understand the CPU and
memory issues that many of you with smaller email loads have been
experiencing.

Jon, Dave and I -- mostly Jon -- have done a lot of work to try to
reduce the resources needed to effectively index and search such a large
body of data, and I think we're getting to a point where things are
truly usable.  

We've made the decision that natively indexing the data, rather than
relying on outside sources (like Camel's indexes) will give us the best
results.  In this specific case, we get the following benefits:

        * Results out of Lucene are just plain faster than out of
          Camel.  Also, because of the batching work I did a couple of
          weeks ago, Lucene hits come out in batches of 200 hits, which
          Camel can't do.  For searches which return thousands of
          results, this makes a significant difference in perception of
          speed.
        
        * We get significantly more accurate relevancy information.  The
          Camel index doesn't have any relevancy at all, so everything
          came out with a default relevancy of 1.0, which is extremely
          high, and then decayed based on when the email was sent.
        
        * All the metadata information for a mail is stored in the index
          with the mail itself.  The Camel index only gave us UIDs for
          the results, and not any additional metadata like the sender,
          subject, etc.  Since we were storing this information in a
          Lucene index, we had to cross-reference this data with a
          search anyway.  This also hurt us a lot with many results.
        
So now we're going to index the mbox files directly in Lucene and
ignoring the Evolution summaries altogether for local mail.  Since they
are our only source of data for IMAP, we'll continue to use those and
index any cached messages on disk.

[ Here are the consequences... ]

I've been working on this new code over the past two weeks and I'm ready
to land it.  IT MEANS A NEW DEPENDENCY.  I did some mono bindings for
the GMime library which helps us parse the mail files and mime parts,
and we're going to be using those.  Chris is working on building
snapshots for it now, and I plan on landing the code shortly after he's
finished with those.  There hasn't been a release with the mono bindings
yet, so if you want to build from scratch, you'll need to check out
"gmime" from GNOME CVS.  Fortunately, it's very simple to build.  Make
sure the mono bindings are enabled following your autogen.sh/configure.

Another thing we're looking into is an Evolution e-plugin to help reduce
some of the processing power needed to use the summaries when new mail
comes in, a mailbox is expunged, etc.  If people are interested in
hacking on that or are curious in specifics, I can go into more detail
in a separate email.

Thanks,
Joe




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]