Re: Summary store, impl n. 7

From: Dave Cridland <dave cridland net>
To: Philip Van Hoof <spam pvanhoof be>
Cc: Jeffrey Stedfast <fejj novell com>, tinymail-devel-list gnome org
Subject: Re: Summary store, impl n. 7
Date: Tue, 01 Jan 2008 20:31:17 +0000

On Thu Dec 27 19:53:03 2007, Philip Van Hoof wrote:

Hi there,

This is my latest experimental summary store code.
o. Makes strings unique in the mmap, which reduces VmSize. This islessimportant other than that uniqueness of strings also reducesVmRss a
   little bit. But more important is ...
o. Sorts strings on usage in the mmap, which reduces VmRss (lesspages
   needed) (this is what Tinymail's mmap Camel patch doesn't do)
o. Stores flags in a separate file. Flags are the only type of datathatchanges relatively often in the summary. Therefore a separatefile.

Hmmm... FLAGS and MODSEQ. I'm not clear if you really need MODSEQ tobe stored at all in the client, mind.

o. Stores "ftell()" offsets in a simplistic index file. Parsing this

file can be improved (I'm using fscanf, which might not evenwork if

   the data file grows to terrabytes -- but let's be honest and just
   accept that this is very unlikely for a summary's datafile to
   happen --).

Surely the offsets are fixed width, so a mmap() candidate?

o. Has a way to expunge an item (and the sequence numbers of higher
   items are adapted automatically)
The index nor mmap are already written when you do this. This isa(relatively trivial) TODO item (simply rewrite the SummaryBlockthatgot changed in the expunge methods, but I'm still making up mymind
   about this rewriting) (#z)
You can expunge items by sequence number and by uid. These arevitalfor the new VANISHED in IMAP with QRESYNC and EXPUNGE in normalIMAP

And unsolicited FETCH FLAGS. Possibly others, too.

It's not yet possible to give uidsets (this is a TODO). With auidsetyou might indeed achieve more efficiently getting rid of a rangeof
   items.

o. Has a way to get an item by sequence number quickly (hashtable
   lookup)

Polymer uses two caches for this - one which is by UID, but containsno msgno. The other is essentially a sparse list of UIDs.

Finding a sequence number from a UID is done by successiveapproximation - it assumes that the graph of msgno vs UIDapproximates a straight line, which astonishingly it does. It does alittle more cleverness by quickly checking the currently loadedblocks, too, but that's the bulk of it.

The effect this'd have would be to (usually) map the first and lastblocks, and one other (the one the user was "looking" at). Quiteoften, all three are the same.

So for tiny mailboxes - one or two blocks - it's a straightforwardsuccessive approximation search through the mmapped block containingthe UID. (And that's easy to find).

For medium mailboxes, there's the rather nice property that, ingeneral, users get rid of mail fairly evenly throughout. (There's agentle upward curve, actually, as spam increases).

For massive mailboxes, these turn out to be mailing list archives,etc, where no expunges have happened at all, and the successiveapproximation technique finds the UID with one probe.

o. Has a way to create new items easily. Deals with out-of-orderaddingof items (when requesting an item with a sequence number thatdoesn't
   yet exist, you'll simply get NULL)

You probably want to be returning things other than NULL. You've got,at least:


- "This data cannot exist" - ie, the UID or msgno doesn't exist.

- "This data doesn't exist" - ie, the UID or msgno exists, but themessage does not have this data.- "I don't have this data" - ie, the UID or msgno exists, but thedata hasn't been requested yet.

o. Scales to > 200,000 items easily

Cool. I'll be able to read my mail on Modest. It managed tosynchronize my work account yesterday, after several weeks.

o. Should be relatively thread safe (at least some attempts atmaking itthat way are already in place). (although I'd love to uselock-free
   hashtables, I simply used mutexes for this)

I wouldn't get stressed about this. Instead, I'd suggest you get ridof whatever inane threading is going on, and/or supply a wrapper APIwhich marshals requests to the single thread handling the cache andprotocol.

o. The hashtable keys of the sorter doesn't get fsck-ed up anymore,like
   in mytest6.tar.gz :-)

Future:

o. Indexed on blocks of 1000 items: I still need to make the per
   sequence getter calculate the block where it can find the item
o. Writing out per block, in stead of one large mmapped file. Allmagicis already in place for this, I just need to make a smallalgorithm
   to define the filename for a sequence number.
o. A freeze/thaw API for Dave so that he can make multiple changes.This
   integrates with (#z) of course.

Yes... I find it improves performance hugely when the user highlightsall messages and hits "delete".


Dave.
--
Dave Cridland - mailto:dave cridland net - xmpp:dwd jabber org
 - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
 - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade

Follow-Ups:
- Re: Summary store, impl n. 7
  - From: Philip Van Hoof

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]