RE: Last version of the new summary storage engine



Hi Philip,

> -----Original Message-----
> From: ext Philip Van Hoof [mailto:spam pvanhoof be] 
> 
> Dirk-Jan Binnema nokia com wrote:
> 
> I tried sending a reply, but it seems that that failed. If 
> you received 
> one, here's another one :)
> 
<snip>

> The mantra "using fewer memory, results in a faster 
> application" is very 
> true in our case.

Sure -- my point was that really that for any optimization
we do, we can think of some reason why it would *really*
make things a *lot* faster. But experience shows that
it's very easy to fool ourselves. In other words, while
the discussion *sounds* reasonable, I'm not going to
believe before I see it :)

> >> What is much harder, there's also a (significant?) *cost* 
> to scan for 
> >> duplicate strings, to reconstruct them and so on, when 
> showing a folder.
> >>     
> With a hashtable-based implementation you don't need to scan for 
> duplicates. You just add items using the string as key. 
> Hashtables are 
> fast at this because they make a hash of the string-key. These are 
> integer comparisons. Onces all items are in the Hashtable, 
> you just walk 
> the table writing down each string, updating the index about the 
> location (that's the "affected" list in the test).

So you're only taking full strings (ie., the whole From: address
or Subject:) and no substrings. 

Still, using hashtables adds some time. Suppose we have 1000 emails.
1) It will take extra time because of the hash-table.
2) It might save some time because it uses less memory, cache etc.
Depending on the number of duplicates, 1) or 2) will be dominant.
If you don't have many duplicate strings, it's likely to only
slow things down. The question is -- when does it become faster?

> Another reason why this will be a lot faster is because the 
> size of each 
> mapped file will be 1000 items. The current mapped file 
> simply contains 
> all items. Rewriting a folder with 50,000 items means that we 
> have to do 
> 50,000 x 4 string writes. With this summary storage we will 
> only rarely 
> have to uniq and sort a file with 1000 x 4 strings, a 
> sequential index 
> and a sequential flags file (writing the sequential ones is 
> too fast to 
> even mention, really).

Well, this all depends on the number of times a string is
duplicated. In many cases, there won't be that many. Let's be
careful not to optimize for some unusual use case, while making
the normal use case slower.

> Ironically we also write a smaller mapped file because we have fewer 
> strings to write (we made them unique): writing is slower 
> than searching 
> for duplicates. And that was a test on a fast harddisk, not even on a 
> ultra slow flashdisk. Although, indeed, a tablet's memory 
> access will be 
> slower than a PC's memory access too (it's therefore not a guarantee, 
> but we'll test this of course).

Flash disks are not that bad actually. And seeking is not as 
expensive.
 
> I tested this with 100,000 items (with 50% duplicates, this 
> is very much 
> like the Linux mailing list's folder): this was actually faster than 
> writing 50,000 items using the current summary format.

Could you try with a 'normal' folder? A little cmdline tool to
show the amount of duplicates could be quite nice...

> >> To me at least it's not obvious that this optimization 
> will not actually
> >> slow things down.
> >>     
> That's why I'm making this test. To test that :)
> 
> >> So, before rolling this out, I'd recommend you to do some 
> numbers of the
> >> difference with/without string clustering (this should 
> include showing
> >> the items).
> >>     
> Of course.

Excellent. I hope the optimization works as you think of course,
I am just being constructively-skeptical :)

Best wishes,
Dirk.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]