Re: Summary store, impl n. 8

From: Philip Van Hoof <spam pvanhoof be>
To: Dave Cridland <dave cridland net>
Cc: Jeffrey Stedfast <fejj novell com>, tinymail-devel-list gnome org
Subject: Re: Summary store, impl n. 8
Date: Fri, 04 Jan 2008 00:19:31 +0100
On Tue, 2008-01-01 at 23:26 +0100, Philip Van Hoof wrote:
> Interesting idea about "cc" and "to" addresses.

Experimentally implemented. It's the caller who'll create SummaryItems
who'll have to provide sorted addresslists by modifying his ENVELOPE
parser in such a way that convenient strings are passed.

I have added to code internally to show how it would work.

The example has the to and cc in reversed order, if you make the (just
add one (any) argument to the cmdline when running) and open the
data_0.mmap file, you'll notice that only one string is stored for both
cc and to (the example reuses all strings 50,000 times, but it's of
course an sample with extremely redundant data).



>  dvalue :)
> pvanhoof And flag changes of items happen in a separate file (this is for email)
> pvanhoof Nah, they are always strings. I don't think I need a generic type
> joh desrt: Can't wait to see dconf finished :)
> desrt pvanhoof; it's more for the convenience factor
> desrt there's a really nice "write this to a binary file" API
> pvanhoof But I'm sure following up on dvalue and dconf :)
> desrt and "read this out of a binary file" API
> * tbf (~mathias ip-80-226-0-1 vodafone-net de) has joined #gnome-hackers
> desrt what is the file a mapping of, anyway?
> desrt ID numbers to strings?
> pvanhoof summary data, that's your subject, cc, to, from, uid, sequence number
> pvanhoof And the flags of your email
> pvanhoof or, the ENVELOPE in IMAP terminology
> desrt so you want to store common "from" and "to" fields without duplication
> pvanhoof aye
> desrt cool
> pvanhoof And most referred to grouped together
> pvanhoof The goal is to have fewer 4k pages in real memory modules
> desrt is the to a string or an array of strings?
> desrt (think: multiple recipients)
> pvanhoof And most of the less frequently used ones in virtual memory
> pvanhoof I handle all of the summary items as just strings
> desrt recogniser your to/cc strategy
> pvanhoof To save pointers 
> desrt er
> desrt *reconsider
> pvanhoof Yes, true. cc and to could be lists of strings to find even more duplicates
> pvanhoof but a list would also mean, more pointers in my heap
> desrt it's true.
> desrt but chances are that a huge cc: list isn't going to be exactly the same every time around
> desrt although you could improve your chances by sorting the list
> desrt so at least permutations don't count as different entities
> pvanhoof Agree, but chances that a user has a lot of huge cc lists are rather small (spammers don't do this a lot :p)
> desrt :)
> pvanhoof yep, I sort the strings on frequency used
> desrt no no
> pvanhoof so those unique strings will be at the tail of the file
> desrt i mean if you have
> desrt cc: pvanhoof gnome org, desrt desrt ca
> tbf pvanhoof: you really don't know normal people....
> desrt then later
> desrt cc: desrt desrt ca, pvanhoof gnome org
> * RealNitro has quit (Ex-Chat)
> pvanhoof oh yes
> desrt ^ these are the same, but would be stored separately
> pvanhoof that would be two strings, four pointers .. 
> desrt well
> desrt using my method it would be
> desrt using your method it would be 2 (long) strings, 2 pointers
> dobey joh: store the "bookmarked" list in a different key
> desrt if you sort the list and then use your method
> pvanhoof right now this is two strings two pointers, but for larger samples it makes sense to store addresses separately yes
> tbf pvanhoof: mails regarding my fencing club usually have quite large CC lists
> desrt then it's 1 long string and 2 pointers
> desrt you're still not listening to me :p
> dobey joh: though i don't know how you're storing anything currently
> pvanhoof I am I am :)
> desrt i mean that you should take the _string_ "pvanhoof gnome org, desrt desrt ca" and convert it to the string "desrt desrt ca, pvanhoof gnome org"
> pvanhoof ah yes
> pvanhoof interesting
> desrt since that way you avoid having lots of copies of what is essentially the same list
> pvanhoof Just sort the string's substrings 
> desrt exactly
> pvanhoof that's an interesting idea indeed
> desrt ok.  dinner time.  bye :)
> lifeless fails when you have 2 and 3 element long strings with common elements :)
> 
> 
> On Tue, 2008-01-01 at 23:08 +0100, Philip Van Hoof wrote:
> > On Tue, 2008-01-01 at 20:31 +0000, Dave Cridland wrote:
> > > On Thu Dec 27 19:53:03 2007, Philip Van Hoof wrote:
> > 
> > > > o. Stores flags in a separate file. Flags are the only type of data  
> > > > that changes relatively often in the summary. Therefore a separate  
> > > > file.
> >  
> > > Hmmm... FLAGS and MODSEQ. I'm not clear if you really need MODSEQ to  
> > > be stored at all in the client, mind.
> > 
> > If the MODSEQ number is not often changed (if ever), it can be put in
> > the mapped file. If it often changes, it can be put in the flags file as
> > a new column there (like after the flags bitfield number).
> > 
> > Of course is it sizeof(int) more heap memory per item. But if this field
> > is needed in your client, it can of course be added (or seek()ed for in
> > the property's getter if it's not often needed, and can therefore be
> > slow).
> > 
> > > > o. Stores "ftell()" offsets in a simplistic index file. Parsing this 
> > > > 
> > > Surely the offsets are fixed width, so a mmap() candidate?
> > 
> > Width fixed widths the strings you store are either too limited in their
> > length (like, 50 characters) or too wide for the typical ones (like 250
> > chars or more).
> > 
> > If you store strings uniquely and have some sort of index file anyway
> > (like this experiment) then a fixed width for the strings themselves is
> > not really needed I think. Just ftell() each time you write a string and
> > update the index.
> > 
> > > > o. Has a way to expunge an item (and the sequence numbers of higher
> > > >    items are adapted automatically)
> > > >  ...
> > > > 
> > > And unsolicited FETCH FLAGS. Possibly others, too.
> > 
> > Yep, unsolicited FETCH FLAGS can be coped with with this experimental
> > implementation (or will be once I coded writing the flags file after a
> > flags update), EXPUNGE and VANISHED too. An EXISTS followed by a FETCH
> > ENVELOPE and also an unsolicited NOTIFY's ENVELOPE can be coped with
> > too.
> > 
> > Note that I want the flags file to be written less frequent than flags
> > get set to items -> some way to delay the write of the file, because
> > it's quite likely that subsequent flag changes take place in a short
> > amount of time. Like a burst of unsolicited FETCH FLAGS or the user
> > doing a "select all" and then "mark as read".
> > 
> > > > o. Has a way to get an item by sequence number quickly (hashtable
> > > >    lookup)
> >  
> > > > 
> > > Polymer uses two caches for this - one which is by UID, but contains  
> > > no msgno. The other is essentially a sparse list of UIDs.
> > 
> > > Finding a sequence number from a UID is done by successive  
> > > approximation - it assumes that the graph of msgno vs UID  
> > > approximates a straight line, which astonishingly it does. It does a  
> > > little more cleverness by quickly checking the currently loaded  
> > > blocks, too, but that's the bulk of it.
> > 
> > Right, sounds sensible to optimise this indeed. Right now I just store
> > two separate hashtables that have their values point to the same items.
> > 
> > It's indeed rather dumb and not very memory-efficient (a hashtable's
> > node is quite big in size).
> > 
> > The reason for these simplistic hashtables's-method was that out of
> > order inserting of data was easy to implement this way (just insert new
> > items to the table).
> > 
> > 
> > Mindnote:
> > A draw-back from a GPtrArray with a hashtable is the foreach speed which
> > will be needed when copying the item pointers to the GtkTreeModel's
> > store if that one is GPtrArray based: a foreach over a GPtrArray is of
> > course just a for-loop, a foreach over a hashtable is (like) a doubly
> > linked list-loop (does more dereferencing which is afaik slower than
> > than just doing the i++ math in the for-loop).
> > 
> > 
> > > The effect this'd have would be to (usually) map the first and last  
> > > blocks, and one other (the one the user was "looking" at). Quite  
> > > often, all three are the same.
> > 
> > > So for tiny mailboxes - one or two blocks - it's a straightforward  
> > > successive approximation search through the mmapped block containing  
> > > the UID. (And that's easy to find).
> > 
> > > For medium mailboxes, there's the rather nice property that, in  
> > > general, users get rid of mail fairly evenly throughout. (There's a  
> > > gentle upward curve, actually, as spam increases).
> > 
> > Right.
> > 
> > > For massive mailboxes, these turn out to be mailing list archives,  
> > > etc, where no expunges have happened at all, and the successive  
> > > approximation technique finds the UID with one probe.
> > 
> > 
> > > > o. Has a way to create new items easily. Deals with out-of-order  
> > > > adding of items (when requesting an item with a sequence number that  
> > > > doesn't yet exist, you'll simply get NULL)
> > > > 
> > > You probably want to be returning things other than NULL. You've got,  
> > > at least:
> > > 
> > > - "This data cannot exist" - ie, the UID or msgno doesn't exist.
> > > - "This data doesn't exist" - ie, the UID or msgno exists, but the  
> > > message does not have this data.
> > > - "I don't have this data" - ie, the UID or msgno exists, but the  
> > > data hasn't been requested yet.
> > 
> > So some sort of status or error code that you get when requesting an
> > item. Something like this perhaps?
> > 
> > typedef enum {
> > 	notyetreceived,
> > 	fine,
> > 	nonexisting
> > } SummaryItemState;
> > 
> > typedef struct {
> > 	...
> > } SummaryItem;
> > 
> > 
> > Summary *s = ...
> > SummaryItem *item = NULL;
> > SummaryItemState state;
> > 
> > summary_item_by_uid (s, "0001", &item, &state);
> > 
> > if (!item && state == notyetreceived) {
> > 	printf ("...\n");
> > } else if (item && state == fine) {
> > 	printf ("%s\n", summary_item_get_subject (item));
> > }
> > 
> > > > o. Scales to > 200,000 items easily
> > 
> > > Cool. I'll be able to read my mail on Modest. It managed to  
> > > synchronize my work account yesterday, after several weeks.
> > 
> > :-)
> > 
> > Modest is modestly aimed at supporting < ~30,000 items on Nokia's
> > devices. I have a fairly good idea why more items make the application
> > too slow (mostly a combination of GtkTreeView-style live sorting while
> > retrieving takes place and meanwhile periodic writing of the summary's
> > mmapped file).
> > 
> > Regretfully I don't have the funding to work on improving these problems
> > nor are really large mailboxes a very high priority for Modest atm. This
> > experimental summary implementation is a first step to trying to solve
> > one of the two problems Modest right now has, with very large mailboxes.
> > 
> > The demoui or TMut can cope with ~ 70,000 items on a N800 device, by the
> > way. Just don't start sorting things.
> > 
> > Obviously must Modest do sorting for the user. An end-user doesn't
> > really care a lot about why a sorted summary makes his mobile E-mail
> > client not cope with insanely large mailboxes. Except if he has one,
> > like you have :D, of course.
> > 
> > (Should be fairly easy to mock with Modest's source and shortcut te
> > GtkTreeModelSortable stuff and just directly assign as the treeview's
> > model the TnyGtkHeaderListModel -- search for that type --).
> > 
> > > > o. Should be relatively thread safe (at least some attempts at  
> > > > making it
> > > >    that way are already in place). (although I'd love to use  
> > > > lock-free
> > > >    hashtables, I simply used mutexes for this)
> > > > 
> > > > 
> > > I wouldn't get stressed about this. Instead, I'd suggest you get rid  
> > > of whatever inane threading is going on, and/or supply a wrapper API  
> > > which marshals requests to the single thread handling the cache and  
> > > protocol.
> > 
> > Regretfully, though, you need to access the summary from both the UI
> > thread and from the thread that is receiving stuff.
> > 
> > You can't really marshal the requests coming from the UI thread unless
> > you want to make things hideously slow for the user interface. You can,
> > however, indeed, marshal the requests (to store new stuff) coming from
> > the thread that is receiving new material (you queue it).
> > 
> > Queueing for that thread is something I think the application itself (or
> > in my case Tinymail, for that matter) should take care of, no? That's my
> > current idea.
> > 
> > -> Direct access to the summary from the UI's thread
> > -> Queued access to the summary from the data receiving thread, perhaps
> >    queued in the UI's thread (which wouldn't require thread safety
> >    anymore then)
> > 
> > > > o. A freeze/thaw API for Dave so that he can make multiple changes.  
> > > > This integrates with (#z) of course.
> > > 
> > > Yes... I find it improves performance hugely when the user highlights  
> > > all messages and hits "delete".
> > 
> > That should result in saving three empty files once the freeze API is
> > called for. Will be fast, indeed.
> > 
> > 
> > -- 
> > Philip Van Hoof, freelance software developer
> > home: me at pvanhoof dot be 
> > gnome: pvanhoof at gnome dot org 
> > http://pvanhoof.be/blog
> > http://codeminded.be
> > 
> > _______________________________________________
> > tinymail-devel-list mailing list
> > tinymail-devel-list gnome org
> > http://mail.gnome.org/mailman/listinfo/tinymail-devel-list
-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be
Attachment: mytest8.tar.gz
Description: application/compressed-tar
Follow-Ups:
- Re: Summary store, impl n. 8
  - From: Philip Van Hoof
References:
- Re: Summary store, impl n. 7
  - From: Dave Cridland
- Re: Summary store, impl n. 7
  - From: Philip Van Hoof
- Re: Summary store, impl n. 7
  - From: Philip Van Hoof
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]