Re: Summary store, impl n. 7
- From: Dave Cridland <dave cridland net>
- To: Philip Van Hoof <spam pvanhoof be>
- Cc: Jeffrey Stedfast <fejj novell com>, tinymail-devel-list gnome org
- Subject: Re: Summary store, impl n. 7
- Date: Tue, 01 Jan 2008 20:31:17 +0000
On Thu Dec 27 19:53:03 2007, Philip Van Hoof wrote:
Hi there,
This is my latest experimental summary store code.
o. Makes strings unique in the mmap, which reduces VmSize. This is
less
important other than that uniqueness of strings also reduces
VmRss a
little bit. But more important is ...
o. Sorts strings on usage in the mmap, which reduces VmRss (less
pages
needed) (this is what Tinymail's mmap Camel patch doesn't do)
o. Stores flags in a separate file. Flags are the only type of data
that
changes relatively often in the summary. Therefore a separate
file.
Hmmm... FLAGS and MODSEQ. I'm not clear if you really need MODSEQ to
be stored at all in the client, mind.
o. Stores "ftell()" offsets in a simplistic index file. Parsing this
file can be improved (I'm using fscanf, which might not even
work if
the data file grows to terrabytes -- but let's be honest and just
accept that this is very unlikely for a summary's datafile to
happen --).
Surely the offsets are fixed width, so a mmap() candidate?
o. Has a way to expunge an item (and the sequence numbers of higher
items are adapted automatically)
The index nor mmap are already written when you do this. This is
a
(relatively trivial) TODO item (simply rewrite the SummaryBlock
that
got changed in the expunge methods, but I'm still making up my
mind
about this rewriting) (#z)
You can expunge items by sequence number and by uid. These are
vital
for the new VANISHED in IMAP with QRESYNC and EXPUNGE in normal
IMAP
And unsolicited FETCH FLAGS. Possibly others, too.
It's not yet possible to give uidsets (this is a TODO). With a
uidset
you might indeed achieve more efficiently getting rid of a range
of
items.
o. Has a way to get an item by sequence number quickly (hashtable
lookup)
Polymer uses two caches for this - one which is by UID, but contains
no msgno. The other is essentially a sparse list of UIDs.
Finding a sequence number from a UID is done by successive
approximation - it assumes that the graph of msgno vs UID
approximates a straight line, which astonishingly it does. It does a
little more cleverness by quickly checking the currently loaded
blocks, too, but that's the bulk of it.
The effect this'd have would be to (usually) map the first and last
blocks, and one other (the one the user was "looking" at). Quite
often, all three are the same.
So for tiny mailboxes - one or two blocks - it's a straightforward
successive approximation search through the mmapped block containing
the UID. (And that's easy to find).
For medium mailboxes, there's the rather nice property that, in
general, users get rid of mail fairly evenly throughout. (There's a
gentle upward curve, actually, as spam increases).
For massive mailboxes, these turn out to be mailing list archives,
etc, where no expunges have happened at all, and the successive
approximation technique finds the UID with one probe.
o. Has a way to create new items easily. Deals with out-of-order
adding
of items (when requesting an item with a sequence number that
doesn't
yet exist, you'll simply get NULL)
You probably want to be returning things other than NULL. You've got,
at least:
- "This data cannot exist" - ie, the UID or msgno doesn't exist.
- "This data doesn't exist" - ie, the UID or msgno exists, but the
message does not have this data.
- "I don't have this data" - ie, the UID or msgno exists, but the
data hasn't been requested yet.
o. Scales to > 200,000 items easily
Cool. I'll be able to read my mail on Modest. It managed to
synchronize my work account yesterday, after several weeks.
o. Should be relatively thread safe (at least some attempts at
making it
that way are already in place). (although I'd love to use
lock-free
hashtables, I simply used mutexes for this)
I wouldn't get stressed about this. Instead, I'd suggest you get rid
of whatever inane threading is going on, and/or supply a wrapper API
which marshals requests to the single thread handling the cache and
protocol.
o. The hashtable keys of the sorter doesn't get fsck-ed up anymore,
like
in mytest6.tar.gz :-)
Future:
o. Indexed on blocks of 1000 items: I still need to make the per
sequence getter calculate the block where it can find the item
o. Writing out per block, in stead of one large mmapped file. All
magic
is already in place for this, I just need to make a small
algorithm
to define the filename for a sequence number.
o. A freeze/thaw API for Dave so that he can make multiple changes.
This
integrates with (#z) of course.
Yes... I find it improves performance hugely when the user highlights
all messages and hits "delete".
Dave.
--
Dave Cridland - mailto:dave cridland net - xmpp:dwd jabber org
- acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
- http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]