Re: Peeping at the future

From: Philip Van Hoof <spam pvanhoof be>
To: tinymail-devel-list gnome org
Cc: Dave Cridland <dave cridland net>
Subject: Re: Peeping at the future
Date: Fri, 07 Dec 2007 01:23:29 +0100
Reworked version that has the hashtables for looking up the items by
either uid or sequence.

On Thu, 2007-12-06 at 23:51 +0100, Philip Van Hoof wrote:
> Hi there fellow hackers,
> 
> This is some code that I have lying around, that will someday replace
> the summary storage. Probably a few weeks or days after Tinymail 1.0
> gets released, in Tinymail 2.0's then-new branch.
> 
> I kindly invite all the crazy people to check it out, investigate,
> comment what could be better, etc.
> 
> I'll make a quick guide.
> 
> First let's repeat the story of the summary:
> 
>  o. The summary is that of a folder that you want to see when requesting
>     an overview. This means: to, from, subject, cc, flags, size, uid.
> 
>  o. Because this data is quantitative 'large', it consumes most of your
>     E-mail client's memory, unless you are smart. Tinymail tries to be
>     smart by mmaping this data.
> 
>  o. This data is read often, changes seldom, has a lot of duplicate
>     strings (really a lot), when it changes either it's an append,
>     deleted or a flag change. Once appended, it never changes other than
>     flag changes or deleted.
> 
>  o. Some numbers to give you an idea:
> 
> 	o. 30,000 items consume on average 10 MB mmaped data (strings)
> 	o. 6 MB admin (pointers)
> 	o. If not using GStringChunk, add 2 MB heap admin to this
> 	o. Evolution triples these numbers (if not more)
> 
> Then, let's discuss the requirements, problems, details, ideas:
> 
>  o. The core idea is locality of memory (and mmap) data
> 
> 	o. Mmap is fine and all, but if your data is spread around then
> 	   the kernel must map much more pages into real ram modules.
> 
> 	   By putting the most referenced strings close together in the
> 	   beginning of the file, we make the kernel need to load less
> 	   pages. 
> 
> 	   The aim of this is to reduce VmRSS size.
> 
> 	o. Only unique strings are stored, saving disk space and
> 	   therefore also mmap size. Therefore less VM size.
> 
> 	   The aim of this is to reduce the VmSize.
> 
> 	o. Fewer pages that need to be accessed means fewer disk seeks.
> 
> 	o. Fewer pages (in ram) that need to be accessed means fewer
> 	   operations on the databus (mostly interesting for mobiles)
> 
>   o. We'll need fewer writes of the summary data
> 
> 	o. Right now rewriting the summary.mmap *IS* what makes Tinymail
> 	   slow when fetching a large folder (larger than 15,000 items,
> 	   you'll notice this). The solution is to work in blocks in
> 	   stead.
> 
> 	o. Blocks (in this experiment code) are sized at 1000 items.
> 	   This will always be fast, even on slow devices
> 
> 	o. The flags are put in a separate flat sequential file
> 
> 	o. Wipes just get marked, when a lot of items are wiped, a
> 	   rewrite of the block is scheduled (only drastic rewrite
> 	   occasion). (a wipe is an expunge or vanish that got locally
> 	   synced)
> 
> 	o. Appends means that a new block is created, in appending mode
> 	   (new items that got added) 
> 
>   o. Searches don't consume the memory and the mmap for an entire folder
> 
> 	o. The blocks cause that when you search and you get summary
> 	   items, that the items can hold references on a block only, in
> 	   stead of needing to keep a reference on the entire folder's
> 	   summary mmap.
> 
> 	   This makes it possible to do modest searches. Each hit will
> 	   just at least keep a block of 1000 loaded. If multiple hits
> 	   occur in one block, it's just one block with multiple
> 	   references in memory.
> 
> 
> The solution: a three-file one.
> 
> Per block you have:
>   o. An index
>   o. A flags data file
>   o. A mmap file
> 
> The index contains records like:
> 
> 4 uid0 10 2048 94 88 84 80
> 
> This means: 
>  o. The uid is 4 bytes
>  o. The 4 bytes of the uid
>  o. The sequence number is 10
>  o. The size of the E-mail is 2048 octets
>  o. The subject is at offset 94
>  o. The from is at offset 88
>  o. The to is at offset 84
>  o. The cc is at offset 80
> 
> The flags data file contains records like:
> 
> 10 18910
> 
> This means:
> 
> Message with sequence number 10 has flag = 18910
> 
> The data file has \0 delimited strings. The nice thing about this file
> is that strings that got used must, are put in front of the file (the
> file is sorted on usage). The index file's offsets are the amount of
> bytes since the start of this data file.
> 
> 
> Have fun reading code ...
> 
> 
> -- 
> Philip Van Hoof, freelance software developer
> home: me at pvanhoof dot be 
> gnome: pvanhoof at gnome dot org 
> http://pvanhoof.be/blog
> http://codeminded.be
> 
> 
> 
> _______________________________________________
> tinymail-devel-list mailing list
> tinymail-devel-list gnome org
> http://mail.gnome.org/mailman/listinfo/tinymail-devel-list
-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be
Attachment: mytest4.tar.gz
Description: application/compressed-tar
References:
- Peeping at the future
  - From: Philip Van Hoof
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]