RE: Early-posting a new idea for the summary format

From: <Dirk-Jan Binnema nokia com>
To: <spam pvanhoof be>
Cc: tinymail-devel-list gnome org
Subject: RE: Early-posting a new idea for the summary format
Date: Fri, 2 Feb 2007 18:30:50 +0200

Hi Philip, 

>-----Original Message-----
>From: ext Philip Van Hoof [mailto:spam pvanhoof be] 
>Sent: Friday, February 02, 2007 16:09
>To: Binnema Dirk-Jan (Nokia-M/Helsinki)
>Cc: tinymail-devel-list gnome org
>Subject: RE: Early-posting a new idea for the summary format
>
>On Fri, 2007-02-02 at 13:55 +0200, Dirk-Jan Binnema nokia com wrote:
>
>> Mhwww... I am not so convinced about the duplicate string argument; 
>> say, in a mail folder I have 100 mails (and that's a lot!) with the 
>> subject "This is my subject"; by just storing it once, I will save 
>> 99x18 = 1782 bytes. That is not a lot, and of course it adds some 
>> complexity, and makes the summary files quite fragile.
>> 
>> Also note that embedded systems (like the 770/N800) often use 
>> compressed file systems (jffs2), which make the savings even less.
>
>That's true however. The real purpose is to have less memory 
>space being used. By that I mean that strings that reoccur 
>will be more likely to be swapped-in than strings that aren't. 
>The idea is to have an intelligent store that puts the aliases 
>that are used most in the beginning of the file.
>
>That way the block that has the most used strings will be 
>always swapped-in. Whereas the least-used strings will be 
>swapped-in on demand.
>
>It's therefore not just a memory improvement, but also a 
>performance one. And decreasing the amount of needed reads and 
>swap-ins.

Well, so far it sounds more like a 'maybe'; it takes all kind
of assumption including page sizes, frequency of strings,
how often you need them... these complex schemes quite
often turn out not to work as expected, or screw up certain
use cases.

>In other words .. sorting AND searching would be a lot faster. 
>Because the strings that are used most, will be in real-ram, 
>rather than only in the mmap (which might mean that it's still 
>on only available on the
>jffs2 file).

I can imagine that for sorting, you would need *all* the strings,
or at least their collation keys. And search operations are 
probably about strings that are not used very often, so are
likely to be in the mmap file then, not in ram.

So call me a skeptic :) but hey, you can always try it and show
the numbers; and of course I hope I am wrong.

There might be some easier gains - if you want to improve sorting,
for string data, you could store a hash key for them; that
should speed up sorting quite a bit (and it that case, you
wouldn't need the original string in core for sorting).
Then there are collation keys, if you want to take into
account case-insensitive search.

Also, I think there might be now lower-hanging fruit, maybe
use g_slice here and there; but I would like to see some
profiling data. And don't worry, I will put tinymail through
a lot of profiling :-)

>> Flash wearing should not be such a big problem, I think. But taking 
>> out those flags might be interesting. Dunno.
>
>Yes, the flags is something that I will nonetheless take out of it.
>Because right now changing a single flag can cause a full 
>rewrite of the entire mmaped file. That's just plain stupid in 
>terms of time-to-write it and in terms of level wearing too.
>
>So that too would be a major performance improvement, 
>especially for when storing or changing a lot flags at the 
>same time. Though I think those "changes" are cached until the 
>very last moment.

Yup. I don't think it would be really noticeable to the user.

>It's also about investigating things like this. A recent 
>discussion with you about setting the flags made me realise 
>that I didn't comprehend that part good enough myself. So I 
>started digging and discovered some interesting potential 
>performance improvements.
>
>Storing the flags in a different mmap, will fix all of those.

There's always hope :)

>> >That is what it means, yes. Though the summary mmaps could be 
>> >converted between different architectures, but not without a tool.
>> 
>> I guess it would be better to use the 32-bit 'words' on 64 bit 
>> platforms as well. Otherwise, people who share their homedir between 
>> different platforms will get screwed.
>
>Well I don't really see a reason for doing that on a mobile 
>device. But, well, I can of course put the word-length and the 
>"endianism" in the filename of the summary files. This solves it too.


>It basically means that on another architecture, another file 
>will be created.
>
>At this moment this ain't a problem because network byte order 
>is always used for storing integers.
>
>Which can also be a solution for this ... a slightly less 
>efficient one, but one.

Well, there is no problem for the mobile use case; but tinymail
will suck for desktop users which use different archs and
the same ~/.tinymail. Having multiple summaries sucks too, of
course.

>Anyway, I posted the early-idea early for this type of 
>reactions. So please keep 'em going. I know that the summary 
>format can definitely be improved. It's not for today, as what 
>we have today can work too.
>
>But sooner or later, I will proceed and improve the summary 
>format drastically. Probably with the assistance of some jffs2 
>or LogFS developer at some conference ;-)

There is always jffs3 of course :)

>Or by reading filesystem code. As I want to make it that good, 
>that it melts perfectly with the filesystem code. By that I 
>mean, using the right cache size (or, thus, fwrite in stead of 
>write) so that entire blocks are always written (in stead of 
>bytes). Things like that ..

Well... maybe then it's better to read database code... O_DIRECT
anyone ;)

>I really don't want tinymail to be a burden for the flash card.
>Destroying hardware through excessive writing on a device that 
>has to deal with level-wearing, like flash, is not a nice 
>thing to do ;-)
>
>It'll be considered a benefit over over E-mail solutions for mobiles.

Well, I don't think the flash wearout is such a big risk. It 
takes _a lot_ of writes. 

>In other words: compete by being the best. Not just good.

:-) 

Best wishes,
Dirk.

References:
- Early-posting a new idea for the summary format
  - From: Philip Van Hoof
- RE: Early-posting a new idea for the summary format
  - From: Dirk-Jan.Binnema
- RE: Early-posting a new idea for the summary format
  - From: Philip Van Hoof
- RE: Early-posting a new idea for the summary format
  - From: Dirk-Jan.Binnema
- RE: Early-posting a new idea for the summary format
  - From: Philip Van Hoof

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]