Re: Summary store, impl n. 7

From: Philip Van Hoof <spam pvanhoof be>
To: Dave Cridland <dave cridland net>
Cc: Jeffrey Stedfast <fejj novell com>, tinymail-devel-list gnome org
Subject: Re: Summary store, impl n. 7
Date: Tue, 01 Jan 2008 23:08:12 +0100
On Tue, 2008-01-01 at 20:31 +0000, Dave Cridland wrote:
> On Thu Dec 27 19:53:03 2007, Philip Van Hoof wrote:

> > o. Stores flags in a separate file. Flags are the only type of data  
> > that changes relatively often in the summary. Therefore a separate  
> > file.
 
> Hmmm... FLAGS and MODSEQ. I'm not clear if you really need MODSEQ to  
> be stored at all in the client, mind.

If the MODSEQ number is not often changed (if ever), it can be put in
the mapped file. If it often changes, it can be put in the flags file as
a new column there (like after the flags bitfield number).

Of course is it sizeof(int) more heap memory per item. But if this field
is needed in your client, it can of course be added (or seek()ed for in
the property's getter if it's not often needed, and can therefore be
slow).

> > o. Stores "ftell()" offsets in a simplistic index file. Parsing this 
> > 
> Surely the offsets are fixed width, so a mmap() candidate?

Width fixed widths the strings you store are either too limited in their
length (like, 50 characters) or too wide for the typical ones (like 250
chars or more).

If you store strings uniquely and have some sort of index file anyway
(like this experiment) then a fixed width for the strings themselves is
not really needed I think. Just ftell() each time you write a string and
update the index.

> > o. Has a way to expunge an item (and the sequence numbers of higher
> >    items are adapted automatically)
> >  ...
> > 
> And unsolicited FETCH FLAGS. Possibly others, too.

Yep, unsolicited FETCH FLAGS can be coped with with this experimental
implementation (or will be once I coded writing the flags file after a
flags update), EXPUNGE and VANISHED too. An EXISTS followed by a FETCH
ENVELOPE and also an unsolicited NOTIFY's ENVELOPE can be coped with
too.

Note that I want the flags file to be written less frequent than flags
get set to items -> some way to delay the write of the file, because
it's quite likely that subsequent flag changes take place in a short
amount of time. Like a burst of unsolicited FETCH FLAGS or the user
doing a "select all" and then "mark as read".

> > o. Has a way to get an item by sequence number quickly (hashtable
> >    lookup)
 
> > 
> Polymer uses two caches for this - one which is by UID, but contains  
> no msgno. The other is essentially a sparse list of UIDs.

> Finding a sequence number from a UID is done by successive  
> approximation - it assumes that the graph of msgno vs UID  
> approximates a straight line, which astonishingly it does. It does a  
> little more cleverness by quickly checking the currently loaded  
> blocks, too, but that's the bulk of it.

Right, sounds sensible to optimise this indeed. Right now I just store
two separate hashtables that have their values point to the same items.

It's indeed rather dumb and not very memory-efficient (a hashtable's
node is quite big in size).

The reason for these simplistic hashtables's-method was that out of
order inserting of data was easy to implement this way (just insert new
items to the table).


Mindnote:
A draw-back from a GPtrArray with a hashtable is the foreach speed which
will be needed when copying the item pointers to the GtkTreeModel's
store if that one is GPtrArray based: a foreach over a GPtrArray is of
course just a for-loop, a foreach over a hashtable is (like) a doubly
linked list-loop (does more dereferencing which is afaik slower than
than just doing the i++ math in the for-loop).


> The effect this'd have would be to (usually) map the first and last  
> blocks, and one other (the one the user was "looking" at). Quite  
> often, all three are the same.

> So for tiny mailboxes - one or two blocks - it's a straightforward  
> successive approximation search through the mmapped block containing  
> the UID. (And that's easy to find).

> For medium mailboxes, there's the rather nice property that, in  
> general, users get rid of mail fairly evenly throughout. (There's a  
> gentle upward curve, actually, as spam increases).

Right.

> For massive mailboxes, these turn out to be mailing list archives,  
> etc, where no expunges have happened at all, and the successive  
> approximation technique finds the UID with one probe.


> > o. Has a way to create new items easily. Deals with out-of-order  
> > adding of items (when requesting an item with a sequence number that  
> > doesn't yet exist, you'll simply get NULL)
> > 
> You probably want to be returning things other than NULL. You've got,  
> at least:
> 
> - "This data cannot exist" - ie, the UID or msgno doesn't exist.
> - "This data doesn't exist" - ie, the UID or msgno exists, but the  
> message does not have this data.
> - "I don't have this data" - ie, the UID or msgno exists, but the  
> data hasn't been requested yet.

So some sort of status or error code that you get when requesting an
item. Something like this perhaps?

typedef enum {
	notyetreceived,
	fine,
	nonexisting
} SummaryItemState;

typedef struct {
	...
} SummaryItem;


Summary *s = ...
SummaryItem *item = NULL;
SummaryItemState state;

summary_item_by_uid (s, "0001", &item, &state);

if (!item && state == notyetreceived) {
	printf ("...\n");
} else if (item && state == fine) {
	printf ("%s\n", summary_item_get_subject (item));
}

> > o. Scales to > 200,000 items easily

> Cool. I'll be able to read my mail on Modest. It managed to  
> synchronize my work account yesterday, after several weeks.

:-)

Modest is modestly aimed at supporting < ~30,000 items on Nokia's
devices. I have a fairly good idea why more items make the application
too slow (mostly a combination of GtkTreeView-style live sorting while
retrieving takes place and meanwhile periodic writing of the summary's
mmapped file).

Regretfully I don't have the funding to work on improving these problems
nor are really large mailboxes a very high priority for Modest atm. This
experimental summary implementation is a first step to trying to solve
one of the two problems Modest right now has, with very large mailboxes.

The demoui or TMut can cope with ~ 70,000 items on a N800 device, by the
way. Just don't start sorting things.

Obviously must Modest do sorting for the user. An end-user doesn't
really care a lot about why a sorted summary makes his mobile E-mail
client not cope with insanely large mailboxes. Except if he has one,
like you have :D, of course.

(Should be fairly easy to mock with Modest's source and shortcut te
GtkTreeModelSortable stuff and just directly assign as the treeview's
model the TnyGtkHeaderListModel -- search for that type --).

> > o. Should be relatively thread safe (at least some attempts at  
> > making it
> >    that way are already in place). (although I'd love to use  
> > lock-free
> >    hashtables, I simply used mutexes for this)
> > 
> > 
> I wouldn't get stressed about this. Instead, I'd suggest you get rid  
> of whatever inane threading is going on, and/or supply a wrapper API  
> which marshals requests to the single thread handling the cache and  
> protocol.

Regretfully, though, you need to access the summary from both the UI
thread and from the thread that is receiving stuff.

You can't really marshal the requests coming from the UI thread unless
you want to make things hideously slow for the user interface. You can,
however, indeed, marshal the requests (to store new stuff) coming from
the thread that is receiving new material (you queue it).

Queueing for that thread is something I think the application itself (or
in my case Tinymail, for that matter) should take care of, no? That's my
current idea.

-> Direct access to the summary from the UI's thread
-> Queued access to the summary from the data receiving thread, perhaps
   queued in the UI's thread (which wouldn't require thread safety
   anymore then)

> > o. A freeze/thaw API for Dave so that he can make multiple changes.  
> > This integrates with (#z) of course.
> 
> Yes... I find it improves performance hugely when the user highlights  
> all messages and hits "delete".

That should result in saving three empty files once the freeze API is
called for. Will be fast, indeed.


-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be
Follow-Ups:
- Re: Summary store, impl n. 7
  - From: Philip Van Hoof
References:
- Re: Summary store, impl n. 7
  - From: Dave Cridland
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]