Re: [Bug 323065] Thunderbird Backend

From: Pierre Östlund <pierre ostlund gmail com>
To: Kevin Kubasik <kevin kubasik net>
Cc: Dashboard-hackers gnome org
Subject: Re: [Bug 323065] Thunderbird Backend
Date: Sun, 16 Apr 2006 19:49:28 +0200

Kevin Kubasik wrote:

Ok, I've been playing with Pierre's code and its quite similar to what
I did (which was mostly copy and paste Evolution stuff). Where does
the TODO stand at the moment before this backends inclusion into CVS?
It seems that most of Pierre's concerns are WONTFIX's for one reason
or another. Is it worth getting this out into the 'public' a bit for
some more feedback? If not, what exactly do we need to accomplish?
(I'd love to help however I can).

There are basically three things that need much attention:
 1.The imap uri
 2. Automatic re-indexing of files when needed
 3. Thunderbird

1. The imap uri seems to be different depending on which version ofthundebird being used or depending on some imap server setting. Don'tknow which yet. My uri looks something like this:


imap://<server>:<port>/fetch>UID>.INBOX><message id>

But Michal's uri looks like this:

imap://<server>:<port>/fetch>UID>/INBOX><message id>

Why does his uri have slashes while mine have dots? Anyone know anythingabout this?

2. The second thing is the automatic re-indexing code. This would intheory be very straight-forward and easy to implement using inotify, butit's not. Let's say the user fire up thunderbird to download and readnew mails. inotify notifies changes in some of the mork files andinforms the thunderbird backend about this. Now is where we run intotrouble. When a mork file has been updated (mails removed or added), it_must_ be re-read and all mails assigned to that particular mork filehas to be re-indexed. Why? This first needs some explanation of howthunderbird and mork works. The following text is very long, but it willhopefully clear some question marks and it will also explain how theprocess works.

Let's take INBOX.msf for instance. This file contains lots ofinformation about all mails in an inbox (this is a general imap case).This information includes subjects, senders, recipients, mail sizes,dates, message offsets etc. What's interesting here is the the messagesize and offset values, because they are used when extracting mailslater on. In order for the thunderbird backend to perform a full scan,all mails have to be downloaded to the hard drive. This is a settingsthe user has to activate. When this setting has been activated,thunderbird will download and store all mails in one big file with thesame name as the mork file but without the extension (.msf). Now is thetime we need the offsets and sizes. The offset tells us _where_ in thisbig file a particular mail begins and the size tells us when it ends. Itmay for instance begin at (offset) 0x100 and last for 0x90 (which is thesize).

Ok, now when you know the basics, let me explain the hook. When youdownload new mails the big file containing all the mails grow and mailsmay change offsets and sizes (the same goes for the mork file). If wedon't re-read the mork file right away, we will have incorrect offsetsand sizes which in turn leads to us extracting "random data". One mailthat used to exist at 0x100 may now exist at 0x90, which means that wemay end up extracting some parts of that mail and continue extractingsome parts of the mails that follows next after that mail as well.This...well...is not good.

In the code I've posted in the bugzilla I'm using per-account indexingwhich in practice means that all mork files available for an accountwill be added to a list and one file is indexed at the time. When onemork file has been indexed, it will move on to the next one in the listand so on. This requires _one_ ThunderbirdIndexableGenerator peraccount. To work around the problem I've spent the last few minutes toexplain, I've added some inotify code to ThunderbirdIndexableGeneratorto automatically re-index files when changes are discovered. This designis bad because when that object becomes garbage (when all mork files inthe list has been indexed), no updates will handled and beagle will haveto be restarted in order to get mail accounts re-indexed. I would liketo move this code to ThunderbirdIndexer to solve this problem. But thena new problem arise: The ThunderbirdIndexableGenerator-object will notknow of the update and can not add it to the list again for re-index andThunderbirdIndexer cannot create a new ThunderbirdIndexableGeneratorthat indexes only that file.

To make it more clear why we shouldn't do the last thing mentionedabove, lets take INBOX.msf as an example again. When beagle is started,a ThunderbirdQueryable will be created that begins our indexing process.One ThunderbirdIndexableGenerator will eventually be created for eachavailable thunderbird acccount. As long as they have mork files left intheir lists, they will keep indexing mails. So, let's say INBOX.msf isthe last file in that list and it will not be indexed for some time (notuntil all other files have been indexed at least). Bang! Inotify tellsus INBOX.msf has been updated, which means that we want to re-index it.The running ThunderbirdIndexableGenerator-object will not do this sinceit has no inotify code (we assume the code has been moved toThunderbirdIndexer). If we choose to create a newThunderbirdIndexableGenerator-object now to re-index INBOX.msf, thiswill eventually mean that when the firstThunderbirdIndexableGenerator-object reaches INBOX.msf in it' list, wewill have TWO objects indexing the same file! Not good.

I just want to add that the reason that I choose per-account indexinginstead of per-folder indexing (like evolution backend works) is theexpense of mork. I've written a version of this backend that utilizesone ThunderbirdIndexableGenerator per mork file as well (in case we wantto index this way instead). This means one MorkDatabse-object perThunderbirdIndexableGenerator-object which in turn uses lots of ram, butmost of all: it takes lots of time when opening a mork file if it's big.Let's say a user has 5+ accounts with hundreds or thousands of mails(which for instance may be the case if you save all your mailing listmails), it will take forever to open all these mork files. beagle willcreate a long CPU peek when this is happening. A lot of ram will also beconsumed (we already have this problem, no need to make it even moreobvious). Per-account indexing only need one mork file open at any time,which is a big bonus compared to per-folder. Just figured this neededsome explanation. I need comments on all of this.

3. The third thing is thunderbird. Some versions of thunderbird seems tohave a bug that prevents thunderbird from opening a new mail frombeagle-search (or the command-line in general) if thunderbird is alreadyrunning. A fix to this could be created by passing "-mail" tothunderbird, which will open a new thunderbird window and display themail there. Better than using lots of version hacks IMO.

For the search-ui launch program issue, lets just have a
configure/compile time check for Thunderbird's version and work off
that.

Wouldn't it be better to re-write some parts of beagle-search to make itlook for commands in an XML-file? Much better, and easier, IMHO.

I don't wanna be reinventing the wheel, so I'll wait to hear whats
being worked on and what needs work.

I've mentioned some of the things that needs work above. Next afterthose three things are testing, testing, more testing and bug fixing.Currently my biggest issue is time. I'm graduating in exactly two monthsand school is taking all my free time. This means that I will need asmuch help as I can getting this tested and bug fixed until it hopefullyfinds it's way into HEAD.


Thanks!

Pierre

References:
- Re: [Bug 323065] Thunderbird Backend
  - From: Michal Kolodziejczyk
- Re: [Bug 323065] Thunderbird Backend
  - From: Pierre Östlund
- Re: [Bug 323065] Thunderbird Backend
  - From: Michal Kolodziejczyk
- Re: [Bug 323065] Thunderbird Backend
  - From: Kevin Kubasik

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]