Re: Mail work



Hi Joe,

Great mail.  Thanks for keeping us outsiders in the loop.

Regarding relevancy information: what other information will indexing
the mail directly get us?

If you are going to avoid using the Evo summary files altogether how do
you plan to get things like read/unread, flagged, replied to, etc
information into Lucene?  The e-plugin?

-Alex


On Mon, 2004-11-29 at 13:10 -0500, Joe Shaw wrote:
> Hi,
> 
> I meant to send this out over the holiday, sorry it has taken so long.
> It's long-winded, so if you just care about the consequences, scroll
> toward the end.
> 
> As you probably read in Jon's 0.0.3 announcement, we had a meeting two
> weeks ago to try to determine a good plan-of-attack for the project.  We
> came up with a set of milestones, with the goal of shipping Beagle in
> SUSE 9.3.
> 
> Without a doubt, the largest, heaviest, and slowest parts of Beagle thus
> far has been the mail indexer.  And if you're anything like me, you have
> much more email than any other kind of data, and accessing it quickly is
> probably the most important thing when it comes to search.  To give you
> some perspective on my own volume, I have 171,103 messages in a few
> different mailboxes as of this writing, most of them from this year.  In
> addition to that, I have 105,390 messages in my spam folder, although
> that dates back to late 2002.  So I definitely understand the CPU and
> memory issues that many of you with smaller email loads have been
> experiencing.
> 
> Jon, Dave and I -- mostly Jon -- have done a lot of work to try to
> reduce the resources needed to effectively index and search such a large
> body of data, and I think we're getting to a point where things are
> truly usable.  
> 
> We've made the decision that natively indexing the data, rather than
> relying on outside sources (like Camel's indexes) will give us the best
> results.  In this specific case, we get the following benefits:
> 
>         * Results out of Lucene are just plain faster than out of
>           Camel.  Also, because of the batching work I did a couple of
>           weeks ago, Lucene hits come out in batches of 200 hits, which
>           Camel can't do.  For searches which return thousands of
>           results, this makes a significant difference in perception of
>           speed.
>         
>         * We get significantly more accurate relevancy information.  The
>           Camel index doesn't have any relevancy at all, so everything
>           came out with a default relevancy of 1.0, which is extremely
>           high, and then decayed based on when the email was sent.
>         
>         * All the metadata information for a mail is stored in the index
>           with the mail itself.  The Camel index only gave us UIDs for
>           the results, and not any additional metadata like the sender,
>           subject, etc.  Since we were storing this information in a
>           Lucene index, we had to cross-reference this data with a
>           search anyway.  This also hurt us a lot with many results.
>         
> So now we're going to index the mbox files directly in Lucene and
> ignoring the Evolution summaries altogether for local mail.  Since they
> are our only source of data for IMAP, we'll continue to use those and
> index any cached messages on disk.
> 
> [ Here are the consequences... ]
> 
> I've been working on this new code over the past two weeks and I'm ready
> to land it.  IT MEANS A NEW DEPENDENCY.  I did some mono bindings for
> the GMime library which helps us parse the mail files and mime parts,
> and we're going to be using those.  Chris is working on building
> snapshots for it now, and I plan on landing the code shortly after he's
> finished with those.  There hasn't been a release with the mono bindings
> yet, so if you want to build from scratch, you'll need to check out
> "gmime" from GNOME CVS.  Fortunately, it's very simple to build.  Make
> sure the mono bindings are enabled following your autogen.sh/configure.
> 
> Another thing we're looking into is an Evolution e-plugin to help reduce
> some of the processing power needed to use the summaries when new mail
> comes in, a mailbox is expunged, etc.  If people are interested in
> hacking on that or are curious in specifics, I can go into more detail
> in a separate email.
> 
> Thanks,
> Joe
> 
> _______________________________________________
> Dashboard-hackers mailing list
> Dashboard-hackers gnome org
> http://mail.gnome.org/mailman/listinfo/dashboard-hackers
> 
> 




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]