Re: Summary store, impl n. 7



On Thu Dec 27 19:53:03 2007, Philip Van Hoof wrote:
Hi there,

This is my latest experimental summary store code.

o. Makes strings unique in the mmap, which reduces VmSize. This is less important other than that uniqueness of strings also reduces VmRss a
   little bit. But more important is ...

o. Sorts strings on usage in the mmap, which reduces VmRss (less pages
   needed) (this is what Tinymail's mmap Camel patch doesn't do)

o. Stores flags in a separate file. Flags are the only type of data that changes relatively often in the summary. Therefore a separate file.


Hmmm... FLAGS and MODSEQ. I'm not clear if you really need MODSEQ to be stored at all in the client, mind.


o. Stores "ftell()" offsets in a simplistic index file. Parsing this
file can be improved (I'm using fscanf, which might not even work if
   the data file grows to terrabytes -- but let's be honest and just
   accept that this is very unlikely for a summary's datafile to
   happen --).


Surely the offsets are fixed width, so a mmap() candidate?


o. Has a way to expunge an item (and the sequence numbers of higher
   items are adapted automatically)

The index nor mmap are already written when you do this. This is a (relatively trivial) TODO item (simply rewrite the SummaryBlock that got changed in the expunge methods, but I'm still making up my mind
   about this rewriting) (#z)

You can expunge items by sequence number and by uid. These are vital for the new VANISHED in IMAP with QRESYNC and EXPUNGE in normal IMAP


And unsolicited FETCH FLAGS. Possibly others, too.


It's not yet possible to give uidsets (this is a TODO). With a uidset you might indeed achieve more efficiently getting rid of a range of
   items.

o. Has a way to get an item by sequence number quickly (hashtable
   lookup)


Polymer uses two caches for this - one which is by UID, but contains no msgno. The other is essentially a sparse list of UIDs.

Finding a sequence number from a UID is done by successive approximation - it assumes that the graph of msgno vs UID approximates a straight line, which astonishingly it does. It does a little more cleverness by quickly checking the currently loaded blocks, too, but that's the bulk of it.

The effect this'd have would be to (usually) map the first and last blocks, and one other (the one the user was "looking" at). Quite often, all three are the same.

So for tiny mailboxes - one or two blocks - it's a straightforward successive approximation search through the mmapped block containing the UID. (And that's easy to find).

For medium mailboxes, there's the rather nice property that, in general, users get rid of mail fairly evenly throughout. (There's a gentle upward curve, actually, as spam increases).

For massive mailboxes, these turn out to be mailing list archives, etc, where no expunges have happened at all, and the successive approximation technique finds the UID with one probe.


o. Has a way to create new items easily. Deals with out-of-order adding of items (when requesting an item with a sequence number that doesn't
   yet exist, you'll simply get NULL)


You probably want to be returning things other than NULL. You've got, at least:

- "This data cannot exist" - ie, the UID or msgno doesn't exist.
- "This data doesn't exist" - ie, the UID or msgno exists, but the message does not have this data. - "I don't have this data" - ie, the UID or msgno exists, but the data hasn't been requested yet.


o. Scales to > 200,000 items easily


Cool. I'll be able to read my mail on Modest. It managed to synchronize my work account yesterday, after several weeks.


o. Should be relatively thread safe (at least some attempts at making it that way are already in place). (although I'd love to use lock-free
   hashtables, I simply used mutexes for this)


I wouldn't get stressed about this. Instead, I'd suggest you get rid of whatever inane threading is going on, and/or supply a wrapper API which marshals requests to the single thread handling the cache and protocol.


o. The hashtable keys of the sorter doesn't get fsck-ed up anymore, like
   in mytest6.tar.gz :-)

Future:

o. Indexed on blocks of 1000 items: I still need to make the per
   sequence getter calculate the block where it can find the item

o. Writing out per block, in stead of one large mmapped file. All magic is already in place for this, I just need to make a small algorithm
   to define the filename for a sequence number.

o. A freeze/thaw API for Dave so that he can make multiple changes. This
   integrates with (#z) of course.

Yes... I find it improves performance hugely when the user highlights all messages and hits "delete".

Dave.
--
Dave Cridland - mailto:dave cridland net - xmpp:dwd jabber org
 - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
 - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]