Re: Last version of the new summary storage engine



Dirk-Jan Binnema nokia com wrote:

I tried sending a reply, but it seems that that failed. If you received one, here's another one :)
>  o. Duplicate strings are made unique in the mmap
Philip, I am a bit skeptical of this claim -- can you back that up with some numbers? I mean let's take some *extreme* case, where the
same 10-byte string is seen 1000x in a mailbox; by having only
one of them we save about 10k. Not very much compared to the normal
size, even in this extreme case. At least the mem saving is
rather easy to calculate :)
The problem that this new summary storage is trying to solve is not aimed at reducing the amount of (virtual) memory being used (we call this VmSize). If that would be a problem, I wouldn't have done the current summary.mmap format the way I did it (because it wastes virtual memory a lot).
The real problem is the following:

Assume we have a string "Dirk-Jan C. Binnema <Dirk-Jan Binnema nokia com>" as the From address of 20 messages. I received some of these 20 messages a long time ago and some of them I received very recently. Therefore are the strings in the summary located both in the beginning, in the middle and at the end of the mapped file.
Those strings are, however, identical in content. Not in location, but 
in content (which is what matters); they are.
The problem with that is (imagine that I sort my list on "Dirk-Jan") 
that each and every one of the strings will be needed in real memory 
modules. The kernel only knows about pages. A page is four kilobytes in 
size. This means that each page that contains such an identical string, 
must be paged (swapped) into real memory (we call this VmRSS).
Better would be if the string was only stored once, close in location to 
other frequently required strings. That way fewer pages would be needed 
in real memory modules (fewer VmRSS).
We call all of the memory, both real memory and just-mapped, if we cat 
/proc/$PID/status, VmSize: this is the virtual memory's size. This 
number is mostly worthless as it contains the memory needed for the 
mappings (the libraries and the files that we mapped ourselves), the 
stack and the heap allocations (like gslice and malloc). Compared to 
that does valgrind's massif tool only measure heap and stack 
allocations. This is also the most interesting part, as those will have 
a direct impact on the VmRSS.
For Tinymail what valgrind reports is not really the most interesting 
part. Although what valgrind reports has a direct impact on VmRSS, my 
summary's mapping has a larger impact on said VmRSS. When I say that 
Tinymail uses 10MB heap, I don't mention that Tinymail also has a mapped 
file of 15MB that has frequently being trashed. VmRSS (the amount of 
real memory being used) will therefore probably be around 15 or maybe 20 
MB in stead of the 10MB heap that valgrind reported.
The VmRSS indicates the amount of memory that is being accessed. That 
is, the memory that is (therefore) in real memory modules. Compared to 
the VmSize, this is a very important number (the VmSize is actually not 
interesting at all, as it's a polluted number caused by the mappings of 
the libraries and because being part of the virtual memory doesn't mean 
that you as a page are also actually going to be frequently used).
How do we further reduce VmRSS? The answer is simple: we make sure that 
we localize memory of the mapping efficiently. How do we do that? By 
putting memory that is going to be frequently needed together in one 
page in mappings. What memory will that be?
In my example will the string "Dirk-Jan C. Binnema 
<Dirk-Jan Binnema nokia com>" be needed 20 times. The string "Tinne 
Hannes <tinne hannes gmail com>" will probably be needed even more for 
my personal situation. However, Sergio will have an equal amount of 
E-mails from you as I have, but probably only one from Tinne. We 
conclude that in both cases your From string is important, but Tinne's 
From string will only be important for me.
It's therefore quite likely that yours and Tinne's will be needed in 
real memory modules on my tablet PC, and only yours will be needed in 
real and Tinne's one will only perhaps be needed in real memory modules 
on Sergio's tablet PC.
That's why I sort strings on their frequency of usage in the mapping. 
That way will frequently needed strings all be put in the same page. 
Less frequently needed strings will also be put together in their pages.
On speed: swapping from the mapped file (VmSize) to real memory (VmRSS) 
is a slow operation. The fewer times we have to do this, the faster our 
application will be. Especially when doing things like searching and 
sorting.
The mantra "using fewer memory, results in a faster application" is very 
true in our case.
What is much harder, there's also a (significant?) *cost* to scan for duplicate strings, to reconstruct them and so on, when showing a folder.
With a hashtable-based implementation you don't need to scan for duplicates. You just add items using the string as key. Hashtables are fast at this because they make a hash of the string-key. These are integer comparisons. Onces all items are in the Hashtable, you just walk the table writing down each string, updating the index about the location (that's the "affected" list in the test).
Another reason why this will be a lot faster is because the size of each 
mapped file will be 1000 items. The current mapped file simply contains 
all items. Rewriting a folder with 50,000 items means that we have to do 
50,000 x 4 string writes. With this summary storage we will only rarely 
have to uniq and sort a file with 1000 x 4 strings, a sequential index 
and a sequential flags file (writing the sequential ones is too fast to 
even mention, really).
Ironically we also write a smaller mapped file because we have fewer 
strings to write (we made them unique): writing is slower than searching 
for duplicates. And that was a test on a fast harddisk, not even on a 
ultra slow flashdisk. Although, indeed, a tablet's memory access will be 
slower than a PC's memory access too (it's therefore not a guarantee, 
but we'll test this of course).
I tested this with 100,000 items (with 50% duplicates, this is very much 
like the Linux mailing list's folder): this was actually faster than 
writing 50,000 items using the current summary format.
To me at least it's not obvious that this optimization will not actually
slow things down.
That's why I'm making this test. To test that :)

So, before rolling this out, I'd recommend you to do some numbers of the
difference with/without string clustering (this should include showing
the items).
Of course.

Thanks for being concerned. I hope my long reply clarified certain things.

I fear I only added confusion, though. Important is the difference between VmSize and VmRSS.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]