Re: Last version of the new summary storage engine



Dirk-Jan Binnema nokia com wrote:
The mantra "using fewer memory, results in a faster application" is very true in our case.

Sure -- my point was that really that for any optimization
we do, we can think of some reason why it would *really*
make things a *lot* faster. But experience shows that
it's very easy to fool ourselves. In other words, while
the discussion *sounds* reasonable, I'm not going to
believe before I see it :)
Agree.

So you're only taking full strings (ie., the whole From: address
or Subject:) and no substrings.
Indeed. I abandoned the substring idea as that would indeed take too much time to calculate. Perhaps something for the future in case even this isn't memory-optimized enough. But really, I don't think we'll have users wanting to use 500,000 summary items on a phone before phones have a gigabyte of RAM by default too.

Although you will most certainly have Ubuntu bugzilla maintainers claiming they easily need a few times 100,000 items for showing their typical IMAP folder on their phone ;-), I shouldn't say this but still ... I don't think those people can be categorized as "the common user having a common use-case".

In reality do the vast amount of people have folders sized 1,000 to 10,000 items. We must indeed optimize for those numbers (and not for the rare 100,000 cases).

Still, using hashtables adds some time. Suppose we have 1000 emails.
1) It will take extra time because of the hash-table.
2) It might save some time because it uses less memory, cache etc.
Depending on the number of duplicates, 1) or 2) will be dominant.
If you don't have many duplicate strings, it's likely to only
slow things down. The question is -- when does it become faster?
This is something to measure. In both cases the speed was, however, faster. The largest speed gain happens because with this summary store we store in blocks of 1,000 items, rather than writing the entire content each time. That's faster because summary content rarely changes: we're only writing out the new information, while we preserve the existing information.

Well, this all depends on the number of times a string is
duplicated. In many cases, there won't be that many. Let's be
careful not to optimize for some unusual use case, while making
the normal use case slower.

The normal use case is that summary information rarely changes. The 1,000 items block will only be written when 1,000 new items are available. Right now we have to rewrite all of the items no matter what (this is what makes downloading a large folder of 15,000 items or more too slow). Writing only one block and preserving the existing blocks, means only writing 1,000 items. In stead of all.

This makes the normal use-case guaranteed faster too. Although I could also adapt the current solution to cope with fopen (path, "a") or appended writing. This will be difficult to make atomic, though (right now a rename() is an easy way to make things atomic while the original file stays mapped until the very last moment just before the rename() takes place).

Ironically we also write a smaller mapped file because we have fewer strings to write (we made them unique): writing is slower than searching for duplicates. And that was a test on a fast harddisk, not even on a ultra slow flashdisk. Although, indeed, a tablet's memory access will be slower than a PC's memory access too (it's therefore not a guarantee, but we'll test this of course).

Flash disks are not that bad actually. And seeking is not as expensive.
I tested this with 100,000 items (with 50% duplicates, this is very much like the Linux mailing list's folder): this was actually faster than writing 50,000 items using the current summary format.

Could you try with a 'normal' folder? A little cmdline tool to
show the amount of duplicates could be quite nice...
Common use-case:

Because the "To" will always be identical in INBOX, you will always have ~25% duplicate strings (you have ~ 4 strings to deal with). Add to that your favorite "From" addresses, which will probably account for ~2% duplicates, and thread-subject fields, ~1 or ~2% duplicates. Spam will be quite unique in comparison to legitimate E-mails as even the "From" field is usually different (the full name part of it).

Of course.
Excellent. I hope the optimization works as you think of course,
I am just being constructively-skeptical :)
Thanks for that.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]