Re: Summary store, impl n. 7

From: Philip Van Hoof <spam pvanhoof be>
To: Dave Cridland <dave cridland net>
Cc: Jeffrey Stedfast <fejj novell com>, tinymail-devel-list gnome org
Subject: Re: Summary store, impl n. 7
Date: Tue, 01 Jan 2008 23:26:15 +0100
Interesting idea about "cc" and "to" addresses.

desrt you should use dvalue :)
pvanhoof And flag changes of items happen in a separate file (this is for email)
pvanhoof Nah, they are always strings. I don't think I need a generic type
joh desrt: Can't wait to see dconf finished :)
desrt pvanhoof; it's more for the convenience factor
desrt there's a really nice "write this to a binary file" API
pvanhoof But I'm sure following up on dvalue and dconf :)
desrt and "read this out of a binary file" API
* tbf (~mathias ip-80-226-0-1 vodafone-net de) has joined #gnome-hackers
desrt what is the file a mapping of, anyway?
desrt ID numbers to strings?
pvanhoof summary data, that's your subject, cc, to, from, uid, sequence number
pvanhoof And the flags of your email
pvanhoof or, the ENVELOPE in IMAP terminology
desrt so you want to store common "from" and "to" fields without duplication
pvanhoof aye
desrt cool
pvanhoof And most referred to grouped together
pvanhoof The goal is to have fewer 4k pages in real memory modules
desrt is the to a string or an array of strings?
desrt (think: multiple recipients)
pvanhoof And most of the less frequently used ones in virtual memory
pvanhoof I handle all of the summary items as just strings
desrt recogniser your to/cc strategy
pvanhoof To save pointers 
desrt er
desrt *reconsider
pvanhoof Yes, true. cc and to could be lists of strings to find even more duplicates
pvanhoof but a list would also mean, more pointers in my heap
desrt it's true.
desrt but chances are that a huge cc: list isn't going to be exactly the same every time around
desrt although you could improve your chances by sorting the list
desrt so at least permutations don't count as different entities
pvanhoof Agree, but chances that a user has a lot of huge cc lists are rather small (spammers don't do this a lot :p)
desrt :)
pvanhoof yep, I sort the strings on frequency used
desrt no no
pvanhoof so those unique strings will be at the tail of the file
desrt i mean if you have
desrt cc: pvanhoof gnome org, desrt desrt ca
tbf pvanhoof: you really don't know normal people....
desrt then later
desrt cc: desrt desrt ca, pvanhoof gnome org
* RealNitro has quit (Ex-Chat)
pvanhoof oh yes
desrt ^ these are the same, but would be stored separately
pvanhoof that would be two strings, four pointers .. 
desrt well
desrt using my method it would be
desrt using your method it would be 2 (long) strings, 2 pointers
dobey joh: store the "bookmarked" list in a different key
desrt if you sort the list and then use your method
pvanhoof right now this is two strings two pointers, but for larger samples it makes sense to store addresses separately yes
tbf pvanhoof: mails regarding my fencing club usually have quite large CC lists
desrt then it's 1 long string and 2 pointers
desrt you're still not listening to me :p
dobey joh: though i don't know how you're storing anything currently
pvanhoof I am I am :)
desrt i mean that you should take the _string_ "pvanhoof gnome org, desrt desrt ca" and convert it to the string "desrt desrt ca, pvanhoof gnome org"
pvanhoof ah yes
pvanhoof interesting
desrt since that way you avoid having lots of copies of what is essentially the same list
pvanhoof Just sort the string's substrings 
desrt exactly
pvanhoof that's an interesting idea indeed
desrt ok.  dinner time.  bye :)
lifeless fails when you have 2 and 3 element long strings with common elements :)


On Tue, 2008-01-01 at 23:08 +0100, Philip Van Hoof wrote:
> On Tue, 2008-01-01 at 20:31 +0000, Dave Cridland wrote:
> > On Thu Dec 27 19:53:03 2007, Philip Van Hoof wrote:
> 
> > > o. Stores flags in a separate file. Flags are the only type of data  
> > > that changes relatively often in the summary. Therefore a separate  
> > > file.
>  
> > Hmmm... FLAGS and MODSEQ. I'm not clear if you really need MODSEQ to  
> > be stored at all in the client, mind.
> 
> If the MODSEQ number is not often changed (if ever), it can be put in
> the mapped file. If it often changes, it can be put in the flags file as
> a new column there (like after the flags bitfield number).
> 
> Of course is it sizeof(int) more heap memory per item. But if this field
> is needed in your client, it can of course be added (or seek()ed for in
> the property's getter if it's not often needed, and can therefore be
> slow).
> 
> > > o. Stores "ftell()" offsets in a simplistic index file. Parsing this 
> > > 
> > Surely the offsets are fixed width, so a mmap() candidate?
> 
> Width fixed widths the strings you store are either too limited in their
> length (like, 50 characters) or too wide for the typical ones (like 250
> chars or more).
> 
> If you store strings uniquely and have some sort of index file anyway
> (like this experiment) then a fixed width for the strings themselves is
> not really needed I think. Just ftell() each time you write a string and
> update the index.
> 
> > > o. Has a way to expunge an item (and the sequence numbers of higher
> > >    items are adapted automatically)
> > >  ...
> > > 
> > And unsolicited FETCH FLAGS. Possibly others, too.
> 
> Yep, unsolicited FETCH FLAGS can be coped with with this experimental
> implementation (or will be once I coded writing the flags file after a
> flags update), EXPUNGE and VANISHED too. An EXISTS followed by a FETCH
> ENVELOPE and also an unsolicited NOTIFY's ENVELOPE can be coped with
> too.
> 
> Note that I want the flags file to be written less frequent than flags
> get set to items -> some way to delay the write of the file, because
> it's quite likely that subsequent flag changes take place in a short
> amount of time. Like a burst of unsolicited FETCH FLAGS or the user
> doing a "select all" and then "mark as read".
> 
> > > o. Has a way to get an item by sequence number quickly (hashtable
> > >    lookup)
>  
> > > 
> > Polymer uses two caches for this - one which is by UID, but contains  
> > no msgno. The other is essentially a sparse list of UIDs.
> 
> > Finding a sequence number from a UID is done by successive  
> > approximation - it assumes that the graph of msgno vs UID  
> > approximates a straight line, which astonishingly it does. It does a  
> > little more cleverness by quickly checking the currently loaded  
> > blocks, too, but that's the bulk of it.
> 
> Right, sounds sensible to optimise this indeed. Right now I just store
> two separate hashtables that have their values point to the same items.
> 
> It's indeed rather dumb and not very memory-efficient (a hashtable's
> node is quite big in size).
> 
> The reason for these simplistic hashtables's-method was that out of
> order inserting of data was easy to implement this way (just insert new
> items to the table).
> 
> 
> Mindnote:
> A draw-back from a GPtrArray with a hashtable is the foreach speed which
> will be needed when copying the item pointers to the GtkTreeModel's
> store if that one is GPtrArray based: a foreach over a GPtrArray is of
> course just a for-loop, a foreach over a hashtable is (like) a doubly
> linked list-loop (does more dereferencing which is afaik slower than
> than just doing the i++ math in the for-loop).
> 
> 
> > The effect this'd have would be to (usually) map the first and last  
> > blocks, and one other (the one the user was "looking" at). Quite  
> > often, all three are the same.
> 
> > So for tiny mailboxes - one or two blocks - it's a straightforward  
> > successive approximation search through the mmapped block containing  
> > the UID. (And that's easy to find).
> 
> > For medium mailboxes, there's the rather nice property that, in  
> > general, users get rid of mail fairly evenly throughout. (There's a  
> > gentle upward curve, actually, as spam increases).
> 
> Right.
> 
> > For massive mailboxes, these turn out to be mailing list archives,  
> > etc, where no expunges have happened at all, and the successive  
> > approximation technique finds the UID with one probe.
> 
> 
> > > o. Has a way to create new items easily. Deals with out-of-order  
> > > adding of items (when requesting an item with a sequence number that  
> > > doesn't yet exist, you'll simply get NULL)
> > > 
> > You probably want to be returning things other than NULL. You've got,  
> > at least:
> > 
> > - "This data cannot exist" - ie, the UID or msgno doesn't exist.
> > - "This data doesn't exist" - ie, the UID or msgno exists, but the  
> > message does not have this data.
> > - "I don't have this data" - ie, the UID or msgno exists, but the  
> > data hasn't been requested yet.
> 
> So some sort of status or error code that you get when requesting an
> item. Something like this perhaps?
> 
> typedef enum {
> 	notyetreceived,
> 	fine,
> 	nonexisting
> } SummaryItemState;
> 
> typedef struct {
> 	...
> } SummaryItem;
> 
> 
> Summary *s = ...
> SummaryItem *item = NULL;
> SummaryItemState state;
> 
> summary_item_by_uid (s, "0001", &item, &state);
> 
> if (!item && state == notyetreceived) {
> 	printf ("...\n");
> } else if (item && state == fine) {
> 	printf ("%s\n", summary_item_get_subject (item));
> }
> 
> > > o. Scales to > 200,000 items easily
> 
> > Cool. I'll be able to read my mail on Modest. It managed to  
> > synchronize my work account yesterday, after several weeks.
> 
> :-)
> 
> Modest is modestly aimed at supporting < ~30,000 items on Nokia's
> devices. I have a fairly good idea why more items make the application
> too slow (mostly a combination of GtkTreeView-style live sorting while
> retrieving takes place and meanwhile periodic writing of the summary's
> mmapped file).
> 
> Regretfully I don't have the funding to work on improving these problems
> nor are really large mailboxes a very high priority for Modest atm. This
> experimental summary implementation is a first step to trying to solve
> one of the two problems Modest right now has, with very large mailboxes.
> 
> The demoui or TMut can cope with ~ 70,000 items on a N800 device, by the
> way. Just don't start sorting things.
> 
> Obviously must Modest do sorting for the user. An end-user doesn't
> really care a lot about why a sorted summary makes his mobile E-mail
> client not cope with insanely large mailboxes. Except if he has one,
> like you have :D, of course.
> 
> (Should be fairly easy to mock with Modest's source and shortcut te
> GtkTreeModelSortable stuff and just directly assign as the treeview's
> model the TnyGtkHeaderListModel -- search for that type --).
> 
> > > o. Should be relatively thread safe (at least some attempts at  
> > > making it
> > >    that way are already in place). (although I'd love to use  
> > > lock-free
> > >    hashtables, I simply used mutexes for this)
> > > 
> > > 
> > I wouldn't get stressed about this. Instead, I'd suggest you get rid  
> > of whatever inane threading is going on, and/or supply a wrapper API  
> > which marshals requests to the single thread handling the cache and  
> > protocol.
> 
> Regretfully, though, you need to access the summary from both the UI
> thread and from the thread that is receiving stuff.
> 
> You can't really marshal the requests coming from the UI thread unless
> you want to make things hideously slow for the user interface. You can,
> however, indeed, marshal the requests (to store new stuff) coming from
> the thread that is receiving new material (you queue it).
> 
> Queueing for that thread is something I think the application itself (or
> in my case Tinymail, for that matter) should take care of, no? That's my
> current idea.
> 
> -> Direct access to the summary from the UI's thread
> -> Queued access to the summary from the data receiving thread, perhaps
>    queued in the UI's thread (which wouldn't require thread safety
>    anymore then)
> 
> > > o. A freeze/thaw API for Dave so that he can make multiple changes.  
> > > This integrates with (#z) of course.
> > 
> > Yes... I find it improves performance hugely when the user highlights  
> > all messages and hits "delete".
> 
> That should result in saving three empty files once the freeze API is
> called for. Will be fast, indeed.
> 
> 
> -- 
> Philip Van Hoof, freelance software developer
> home: me at pvanhoof dot be 
> gnome: pvanhoof at gnome dot org 
> http://pvanhoof.be/blog
> http://codeminded.be
> 
> _______________________________________________
> tinymail-devel-list mailing list
> tinymail-devel-list gnome org
> http://mail.gnome.org/mailman/listinfo/tinymail-devel-list
-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be
Follow-Ups:
- Re: Summary store, impl n. 8
  - From: Philip Van Hoof
References:
- Re: Summary store, impl n. 7
  - From: Dave Cridland
- Re: Summary store, impl n. 7
  - From: Philip Van Hoof
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]