Re: Last version of the new summary storage engine

From: Philip Van Hoof <spam pvanhoof be>
To: Dirk-Jan Binnema nokia com
Cc: tinymail-devel-list gnome org
Subject: Re: Last version of the new summary storage engine
Date: Mon, 10 Dec 2007 14:12:55 +0100

Dirk-Jan Binnema nokia com wrote:

The mantra "using fewer memory, results in a fasterapplication" is verytrue in our case.


Sure -- my point was that really that for any optimization
we do, we can think of some reason why it would *really*
make things a *lot* faster. But experience shows that
it's very easy to fool ourselves. In other words, while
the discussion *sounds* reasonable, I'm not going to
believe before I see it :)

Agree.

So you're only taking full strings (ie., the whole From: address
or Subject:) and no substrings.

Indeed. I abandoned the substring idea as that would indeed take toomuch time to calculate. Perhaps something for the future in case eventhis isn't memory-optimized enough. But really, I don't think we'll haveusers wanting to use 500,000 summary items on a phone before phones havea gigabyte of RAM by default too.

Although you will most certainly have Ubuntu bugzilla maintainersclaiming they easily need a few times 100,000 items for showing theirtypical IMAP folder on their phone ;-), I shouldn't say this but still... I don't think those people can be categorized as "the common userhaving a common use-case".

In reality do the vast amount of people have folders sized 1,000 to10,000 items. We must indeed optimize for those numbers (and not forthe rare 100,000 cases).

Still, using hashtables adds some time. Suppose we have 1000 emails.
1) It will take extra time because of the hash-table.
2) It might save some time because it uses less memory, cache etc.

Depending on the number of duplicates, 1) or 2) will be dominant.

If you don't have many duplicate strings, it's likely to only
slow things down. The question is -- when does it become faster?

This is something to measure. In both cases the speed was, however,faster. The largest speed gain happens because with this summary storewe store in blocks of 1,000 items, rather than writing the entirecontent each time. That's faster because summary content rarely changes:we're only writing out the new information, while we preserve theexisting information.

Well, this all depends on the number of times a string is
duplicated. In many cases, there won't be that many. Let's be
careful not to optimize for some unusual use case, while making
the normal use case slower.

The normal use case is that summary information rarely changes. The1,000 items block will only be written when 1,000 new items areavailable. Right now we have to rewrite all of the items no matter what(this is what makes downloading a large folder of 15,000 items or moretoo slow). Writing only one block and preserving the existing blocks,means only writing 1,000 items. In stead of all.

This makes the normal use-case guaranteed faster too. Although I couldalso adapt the current solution to cope with fopen (path, "a") orappended writing. This will be difficult to make atomic, though (rightnow a rename() is an easy way to make things atomic while the originalfile stays mapped until the very last moment just before the rename()takes place).

Ironically we also write a smaller mapped file because we have fewerstrings to write (we made them unique): writing is slowerthan searchingfor duplicates. And that was a test on a fast harddisk, not even on aultra slow flashdisk. Although, indeed, a tablet's memoryaccess will beslower than a PC's memory access too (it's therefore not a guarantee,but we'll test this of course).
Flash disks are not that bad actually. And seeking is not asexpensive.
I tested this with 100,000 items (with 50% duplicates, thisis very muchlike the Linux mailing list's folder): this was actually faster thanwriting 50,000 items using the current summary format.
Could you try with a 'normal' folder? A little cmdline tool to
show the amount of duplicates could be quite nice...

Common use-case:

Because the "To" will always be identical in INBOX, you will always have~25% duplicate strings (you have ~ 4 strings to deal with). Add to thatyour favorite "From" addresses, which will probably account for ~2%duplicates, and thread-subject fields, ~1 or ~2% duplicates. Spam willbe quite unique in comparison to legitimate E-mails as even the "From"field is usually different (the full name part of it).

Of course.

Excellent. I hope the optimization works as you think of course,
I am just being constructively-skeptical :)

Thanks for that.

References:
- Last version of the new summary storage engine
  - From: Philip Van Hoof
- RE: Last version of the new summary storage engine
  - From: Dirk-Jan.Binnema
- Re: Last version of the new summary storage engine
  - From: Philip Van Hoof
- RE: Last version of the new summary storage engine
  - From: Dirk-Jan.Binnema

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]