Re: [Bug 323065] Thunderbird Backend



Kevin Kubasik wrote:
Ok, I've been playing with Pierre's code and its quite similar to what
I did (which was mostly copy and paste Evolution stuff). Where does
the TODO stand at the moment before this backends inclusion into CVS?
It seems that most of Pierre's concerns are WONTFIX's for one reason
or another. Is it worth getting this out into the 'public' a bit for
some more feedback? If not, what exactly do we need to accomplish?
(I'd love to help however I can).
There are basically three things that need much attention:
 1.The imap uri
 2. Automatic re-indexing of files when needed
 3. Thunderbird

1. The imap uri seems to be different depending on which version of thundebird being used or depending on some imap server setting. Don't know which yet. My uri looks something like this:

imap://<server>:<port>/fetch>UID>.INBOX><message id>

But Michal's uri looks like this:

imap://<server>:<port>/fetch>UID>/INBOX><message id>

Why does his uri have slashes while mine have dots? Anyone know anything about this?

2. The second thing is the automatic re-indexing code. This would in theory be very straight-forward and easy to implement using inotify, but it's not. Let's say the user fire up thunderbird to download and read new mails. inotify notifies changes in some of the mork files and informs the thunderbird backend about this. Now is where we run into trouble. When a mork file has been updated (mails removed or added), it _must_ be re-read and all mails assigned to that particular mork file has to be re-indexed. Why? This first needs some explanation of how thunderbird and mork works. The following text is very long, but it will hopefully clear some question marks and it will also explain how the process works.

Let's take INBOX.msf for instance. This file contains lots of information about all mails in an inbox (this is a general imap case). This information includes subjects, senders, recipients, mail sizes, dates, message offsets etc. What's interesting here is the the message size and offset values, because they are used when extracting mails later on. In order for the thunderbird backend to perform a full scan, all mails have to be downloaded to the hard drive. This is a settings the user has to activate. When this setting has been activated, thunderbird will download and store all mails in one big file with the same name as the mork file but without the extension (.msf). Now is the time we need the offsets and sizes. The offset tells us _where_ in this big file a particular mail begins and the size tells us when it ends. It may for instance begin at (offset) 0x100 and last for 0x90 (which is the size).

Ok, now when you know the basics, let me explain the hook. When you download new mails the big file containing all the mails grow and mails may change offsets and sizes (the same goes for the mork file). If we don't re-read the mork file right away, we will have incorrect offsets and sizes which in turn leads to us extracting "random data". One mail that used to exist at 0x100 may now exist at 0x90, which means that we may end up extracting some parts of that mail and continue extracting some parts of the mails that follows next after that mail as well. This...well...is not good.

In the code I've posted in the bugzilla I'm using per-account indexing which in practice means that all mork files available for an account will be added to a list and one file is indexed at the time. When one mork file has been indexed, it will move on to the next one in the list and so on. This requires _one_ ThunderbirdIndexableGenerator per account. To work around the problem I've spent the last few minutes to explain, I've added some inotify code to ThunderbirdIndexableGenerator to automatically re-index files when changes are discovered. This design is bad because when that object becomes garbage (when all mork files in the list has been indexed), no updates will handled and beagle will have to be restarted in order to get mail accounts re-indexed. I would like to move this code to ThunderbirdIndexer to solve this problem. But then a new problem arise: The ThunderbirdIndexableGenerator-object will not know of the update and can not add it to the list again for re-index and ThunderbirdIndexer cannot create a new ThunderbirdIndexableGenerator that indexes only that file.

To make it more clear why we shouldn't do the last thing mentioned above, lets take INBOX.msf as an example again. When beagle is started, a ThunderbirdQueryable will be created that begins our indexing process. One ThunderbirdIndexableGenerator will eventually be created for each available thunderbird acccount. As long as they have mork files left in their lists, they will keep indexing mails. So, let's say INBOX.msf is the last file in that list and it will not be indexed for some time (not until all other files have been indexed at least). Bang! Inotify tells us INBOX.msf has been updated, which means that we want to re-index it. The running ThunderbirdIndexableGenerator-object will not do this since it has no inotify code (we assume the code has been moved to ThunderbirdIndexer). If we choose to create a new ThunderbirdIndexableGenerator-object now to re-index INBOX.msf, this will eventually mean that when the first ThunderbirdIndexableGenerator-object reaches INBOX.msf in it' list, we will have TWO objects indexing the same file! Not good.

I just want to add that the reason that I choose per-account indexing instead of per-folder indexing (like evolution backend works) is the expense of mork. I've written a version of this backend that utilizes one ThunderbirdIndexableGenerator per mork file as well (in case we want to index this way instead). This means one MorkDatabse-object per ThunderbirdIndexableGenerator-object which in turn uses lots of ram, but most of all: it takes lots of time when opening a mork file if it's big. Let's say a user has 5+ accounts with hundreds or thousands of mails (which for instance may be the case if you save all your mailing list mails), it will take forever to open all these mork files. beagle will create a long CPU peek when this is happening. A lot of ram will also be consumed (we already have this problem, no need to make it even more obvious). Per-account indexing only need one mork file open at any time, which is a big bonus compared to per-folder. Just figured this needed some explanation. I need comments on all of this.

3. The third thing is thunderbird. Some versions of thunderbird seems to have a bug that prevents thunderbird from opening a new mail from beagle-search (or the command-line in general) if thunderbird is already running. A fix to this could be created by passing "-mail" to thunderbird, which will open a new thunderbird window and display the mail there. Better than using lots of version hacks IMO.
For the search-ui launch program issue, lets just have a
configure/compile time check for Thunderbird's version and work off
that.
Wouldn't it be better to re-write some parts of beagle-search to make it look for commands in an XML-file? Much better, and easier, IMHO.
I don't wanna be reinventing the wheel, so I'll wait to hear whats
being worked on and what needs work.
I've mentioned some of the things that needs work above. Next after those three things are testing, testing, more testing and bug fixing. Currently my biggest issue is time. I'm graduating in exactly two months and school is taking all my free time. This means that I will need as much help as I can getting this tested and bug fixed until it hopefully finds it's way into HEAD.

Thanks!

Pierre



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]