Re: [Bug 323065] Thunderbird Backend
- From: Pierre Östlund <pierre ostlund gmail com>
- To: Kevin Kubasik <kevin kubasik net>
- Cc: Dashboard-hackers gnome org
- Subject: Re: [Bug 323065] Thunderbird Backend
- Date: Sun, 16 Apr 2006 19:49:28 +0200
Kevin Kubasik wrote:
Ok, I've been playing with Pierre's code and its quite similar to what
I did (which was mostly copy and paste Evolution stuff). Where does
the TODO stand at the moment before this backends inclusion into CVS?
It seems that most of Pierre's concerns are WONTFIX's for one reason
or another. Is it worth getting this out into the 'public' a bit for
some more feedback? If not, what exactly do we need to accomplish?
(I'd love to help however I can).
There are basically three things that need much attention:
1.The imap uri
2. Automatic re-indexing of files when needed
3. Thunderbird
1. The imap uri seems to be different depending on which version of
thundebird being used or depending on some imap server setting. Don't
know which yet. My uri looks something like this:
imap://<server>:<port>/fetch>UID>.INBOX><message id>
But Michal's uri looks like this:
imap://<server>:<port>/fetch>UID>/INBOX><message id>
Why does his uri have slashes while mine have dots? Anyone know anything
about this?
2. The second thing is the automatic re-indexing code. This would in
theory be very straight-forward and easy to implement using inotify, but
it's not. Let's say the user fire up thunderbird to download and read
new mails. inotify notifies changes in some of the mork files and
informs the thunderbird backend about this. Now is where we run into
trouble. When a mork file has been updated (mails removed or added), it
_must_ be re-read and all mails assigned to that particular mork file
has to be re-indexed. Why? This first needs some explanation of how
thunderbird and mork works. The following text is very long, but it will
hopefully clear some question marks and it will also explain how the
process works.
Let's take INBOX.msf for instance. This file contains lots of
information about all mails in an inbox (this is a general imap case).
This information includes subjects, senders, recipients, mail sizes,
dates, message offsets etc. What's interesting here is the the message
size and offset values, because they are used when extracting mails
later on. In order for the thunderbird backend to perform a full scan,
all mails have to be downloaded to the hard drive. This is a settings
the user has to activate. When this setting has been activated,
thunderbird will download and store all mails in one big file with the
same name as the mork file but without the extension (.msf). Now is the
time we need the offsets and sizes. The offset tells us _where_ in this
big file a particular mail begins and the size tells us when it ends. It
may for instance begin at (offset) 0x100 and last for 0x90 (which is the
size).
Ok, now when you know the basics, let me explain the hook. When you
download new mails the big file containing all the mails grow and mails
may change offsets and sizes (the same goes for the mork file). If we
don't re-read the mork file right away, we will have incorrect offsets
and sizes which in turn leads to us extracting "random data". One mail
that used to exist at 0x100 may now exist at 0x90, which means that we
may end up extracting some parts of that mail and continue extracting
some parts of the mails that follows next after that mail as well.
This...well...is not good.
In the code I've posted in the bugzilla I'm using per-account indexing
which in practice means that all mork files available for an account
will be added to a list and one file is indexed at the time. When one
mork file has been indexed, it will move on to the next one in the list
and so on. This requires _one_ ThunderbirdIndexableGenerator per
account. To work around the problem I've spent the last few minutes to
explain, I've added some inotify code to ThunderbirdIndexableGenerator
to automatically re-index files when changes are discovered. This design
is bad because when that object becomes garbage (when all mork files in
the list has been indexed), no updates will handled and beagle will have
to be restarted in order to get mail accounts re-indexed. I would like
to move this code to ThunderbirdIndexer to solve this problem. But then
a new problem arise: The ThunderbirdIndexableGenerator-object will not
know of the update and can not add it to the list again for re-index and
ThunderbirdIndexer cannot create a new ThunderbirdIndexableGenerator
that indexes only that file.
To make it more clear why we shouldn't do the last thing mentioned
above, lets take INBOX.msf as an example again. When beagle is started,
a ThunderbirdQueryable will be created that begins our indexing process.
One ThunderbirdIndexableGenerator will eventually be created for each
available thunderbird acccount. As long as they have mork files left in
their lists, they will keep indexing mails. So, let's say INBOX.msf is
the last file in that list and it will not be indexed for some time (not
until all other files have been indexed at least). Bang! Inotify tells
us INBOX.msf has been updated, which means that we want to re-index it.
The running ThunderbirdIndexableGenerator-object will not do this since
it has no inotify code (we assume the code has been moved to
ThunderbirdIndexer). If we choose to create a new
ThunderbirdIndexableGenerator-object now to re-index INBOX.msf, this
will eventually mean that when the first
ThunderbirdIndexableGenerator-object reaches INBOX.msf in it' list, we
will have TWO objects indexing the same file! Not good.
I just want to add that the reason that I choose per-account indexing
instead of per-folder indexing (like evolution backend works) is the
expense of mork. I've written a version of this backend that utilizes
one ThunderbirdIndexableGenerator per mork file as well (in case we want
to index this way instead). This means one MorkDatabse-object per
ThunderbirdIndexableGenerator-object which in turn uses lots of ram, but
most of all: it takes lots of time when opening a mork file if it's big.
Let's say a user has 5+ accounts with hundreds or thousands of mails
(which for instance may be the case if you save all your mailing list
mails), it will take forever to open all these mork files. beagle will
create a long CPU peek when this is happening. A lot of ram will also be
consumed (we already have this problem, no need to make it even more
obvious). Per-account indexing only need one mork file open at any time,
which is a big bonus compared to per-folder. Just figured this needed
some explanation. I need comments on all of this.
3. The third thing is thunderbird. Some versions of thunderbird seems to
have a bug that prevents thunderbird from opening a new mail from
beagle-search (or the command-line in general) if thunderbird is already
running. A fix to this could be created by passing "-mail" to
thunderbird, which will open a new thunderbird window and display the
mail there. Better than using lots of version hacks IMO.
For the search-ui launch program issue, lets just have a
configure/compile time check for Thunderbird's version and work off
that.
Wouldn't it be better to re-write some parts of beagle-search to make it
look for commands in an XML-file? Much better, and easier, IMHO.
I don't wanna be reinventing the wheel, so I'll wait to hear whats
being worked on and what needs work.
I've mentioned some of the things that needs work above. Next after
those three things are testing, testing, more testing and bug fixing.
Currently my biggest issue is time. I'm graduating in exactly two months
and school is taking all my free time. This means that I will need as
much help as I can getting this tested and bug fixed until it hopefully
finds it's way into HEAD.
Thanks!
Pierre
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]