RE: [Evolution] bad separation

From: "Ray Lee" <ray madrabbit org>
To: <evolution helixcode com>
Subject: RE: [Evolution] bad separation
Date: Thu, 15 Jun 2000 12:10:38 -0700

Hey there,

I (>>) and Sejal Patel (>) wrote:

Just because you're using libxml doesn't mean you're not paying the CPU

time

to parse the data. Remember, the *common* case will be to load and unload
things from disk. There's no way in hell I can afford enough memory to
maintain my email in a fully parsed XML tree in memory. And there's also

no

reason.

The Principle of Locality greatly reduces the penalty caused by
this. [...] mainstream CPU's will not suffernearly as much as you
think because of the way hardware handles the caching.


Ah, no. First off, locality benefits *linear* data structures. The one thing
you generally *don't* want to do for in memory structures are trees. By
their very nature, the child nodes tend to be non-local from the parents. At
some cut-off point (varying upon your cache size), it's better to just leave
data in an unsorted pile rather than putting it through a tree format.

Secondly, I wasn't so much complaining about CPU time as sheer memory. My
main system has 96MB of RAM, and that's a fraction of my mail size. No
matter how you count it. I'll follow up on the CPU time bit below.

In these cases the best design decision you can make is to make sure the
on-disk format and the in-memory format are the same. That way, you pay

no

overhead for leaving the data on-disk other than the swap time.

I agree that it is best to have a in-memory format as an on-disk format
but from what I can tell, they already do have the structs setup so that
the in-memory could resemble and XML on-disk format.


'Resemble' isn't good enough for mmap(). The point I'm trying to make here
is that you started off by saying XML would be more efficient in CPU time.
I'm saying that's only true if you never have to parse the XML data more
than once. For that to happen, my whole mailbox would have to be maintained
in memory, since the on-disk format of XML and the in-memory formats are
different. Since that's the case, we would be invoking the parse time
repeatedly when moving from message to message.

Repeat after me: XML is not a magic bullet. Why do you want Yet Another

Mail

Format? The mbox and mh/Maildir styles, while not perfect, are common.

XML is not a magic bullet.


(<grin> I promise to never use these powers over you in the service of evil
:-).)

Quick question though, does your copy of
evolution actually maintain the external mbox.


I'm sure your copy is what my copy would do were Evolution installed on my
system. I'm waiting for the codebase to settle down a bit before I jump in.
Well, that, and I've got too many other paying projects right now that are
taking my attention.

I ask because on my copy
it simply removes everything from my mbox and appends it to it's own mbox
thing.


<shrug> There may be a good reason for this. I would have expected that it
would leave ~/mbox as your default Inbox, and move messages out of it as you
moved them to other folders (assuming Evolution will support multiple
physical folders as well as vFolders, something that I'd recommend strongly
if it isn't already on the drawing boards). As a quick guess, they're doing
this to make sure no one messes with the mbox in a way that invalidates the
index. But for my example, this doesn't matter. Procmail would have been
invoked during delivery, rather than after it sits in the mbox.

Since it was removing the mbox anyway, I was just curious as to
why the format they were keeping had to be mbox and not something
a little better because as you said, it isn't perfect.


So people can fall back to using Mutt/whatever when they telnet into their
system?

Well the question is whether or not it is a half gig of big messages or
lots of little ones.


The latter.

Also, I'm quite tempted to do some
actual speed comparisons between the "\n\nForm " along with the special
condition checks you occasionally have to do to the messages in comparison
to libxml but I'd guess that they would be relatively the same.


Highly doubtful. Assuming both are on disk, in XML form you have to keep
track of context ('Is this a tag inside a message that happens to look like
one of my special outside tags?'). In mbox format, you can do a Boyer-Moore
or KMP search to find the start. It's gonna be lightyears faster.

Well, I usually delete the messages after I read them so my
definition of large mailboxes is significantly smaller then yours.


I suppose I could do this for personal mail, but I have to keep work related
messages. Plus mailing list messages, where a message may not be useful now,
but it may be useful later. That's saved my butt more than once, and with a
text index of my messages (Oh momma!), I'd never leave Linux. Well, except
when I have to reboot into W2k to make money... Or play Asheron's Call...

All things said, I have some empathy for your position. I just don't have
empathy for the new format being XML :-).

Ray
--
rblee impulse net  ~  ray madrabbit org

References:
- RE: [Evolution] bad separation
  - From: Sejal Patel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]