Re: [Evolution] bad separation

From: Michael Poole <poole troilus org>
To: Sejal Patel <sejal iname com>
Cc: evolution helixcode com
Subject: Re: [Evolution] bad separation
Date: 14 Jun 2000 17:54:56 -0400

Sejal Patel <sejal iname com> writes:

On 14 Jun 2000, Michael Poole wrote:

[snip]

Because what are broken up into a "simplier formating style"?  You
don't need to keep a flat representation of the mbox file in memory
even if you use the (standard!) mbox format on disk.


So why is it so important (other the direct compatibility without
exporting) to have a flat representation as an mbox.  This whole idea
started off was a way of making a more maintainable and managable system
then the mbox.


It's exactly for that reason that it would be nice to stay with mbox.
Between ease of handling (not just in code, but also from the command
line or a plain text editor) and interoperation, any new scheme would
have to have tremendous advantages to convince me it would be a good
idea to switch to something else.

 Also, just because it is standard and has been used for
several years (something like 22 years I think) does not mean it is the
absolute best way of doing things.  It just means that it is an accepted
way of doing things (I know I've set myself up here for a huge bashing but
I didn't mean it in a bad way).


It does, however, mean that nothing better has come along and supplanted
it in that time.  mbox has a huge installed base, and it would be very
useful to be able to directly interoperate with that base.

This goes back to the indexing I mentioned: in this, XML doesn't give
you anything that having an index per header field does not give you.
In fact, if you rely on the XML parsers' search functions for this,
you will miss valid aliases.  For example, mail sent to
"john(likes)@(to)annoy . you" would go to the same place as mail sent
to "Johnny Annoyance <john annoy you>", and relying on an XML search
fails to capture that.


To tell you the truth, I didn't even know that you could have
"john(likes)@(to)annoy . you" as an address field and it magically resolve
down to "Johnny Annoyance <john annoy you>".  Thus it was a bit difficult
for me to see that this was a problem.  However, I'm still not a 100% sure
of what you're talking about here so there may or may not be a clean way
of doing this in XML.  Would you mind explaining that a bit more.


(side note: I didn't say it would be magically resolved to "Johny
Annoyance <john annoy you>".  It won't.  Both will resolve to
"john annoy you")

My point is this: Many header fields that you would wish to search on
can have several ways to write the same information.  If you rely on
most XML parsers' search capabilities, you have two options: linear
search on every instance of that header (ie examine every header line
with that name), or do an indexed search on a particular way of
writing that information.  Adding another index on those fields is
then a fair amount of work, and possibly requires new architecture.

The general nature of having an XML tree (at least all the good ones I've
seen) is that it is a tree and that in a worse case scenario you have a
O(log N) search time when "chasing" pointers around everywhere.  Of course
your IMAP server is going to be using mbox over XML right now because the
idea of using XML for mail storage is not exactly old news.  XML was not
even around when IMAP was.  I would doubt that it has a much higher
performance than a quality internal XML system.


Please explain how you get O(log N) search time for this XML tree
(since it's not a binary search tree, you don't automatically get it).

Rather, my reference to chasing pointers around is this situation:
You have pointer to the root node of the XML mailbox.
You want to search for "From" headers with Joe Smith's name in them.
Your code then iterates over every child of the root node.
For each message node, it finds all the "from" children nodes.
For each "from" node, it looks up the text content, and sees if it
contains Joe Smith's name.

If libxml provides some more efficient way to do this search, I'd like
to know what it is.  With a normal mbox store, you can search for the
string "\nFrom: " (taking O(mbox_size/7) time even without any helper
data that you might have built when reading it in) and culling from
there.  It's not obvious to me which way is faster, but an inverse
text index (which is what my IMAP server uses) will be much faster
than either of those.

It's not hard to devise an architecture that gives you the same
modularity for searching and processing mbox-stored files.  I don't
think you need much more modularity than per-header and per-MIME-type
operations.


So what you're saying is that the mere idea of using XML is complete
ludicracy even if would end up making things simplier.  If you keep adding
little bits and pieces to twist one thing to give it the ability to do
another, it doesn't mean that it is better then something that is designed
to be able and handle that in the first place.


I could say just as well that parsing a mail message and storing it in
XML is twisting it so that you can keep addking little bits and pieces
to it, and that the mbox format was designed to be able to handle mail
messages in the first place.

I think that using XML will actually make various code more complex.
I think that using XML will actually make it slower than it could be.
Ray Lee pointed out that using XML will make it significantly harder
to have very large mail stores.

You have to do that lookahead one (1) time with mbox.  With XML you
need to special-case &, <, and > everywhere they occur in your code
for reading.  Random access (say, by a message number in the store)
is also harder if you're using an XML tree to represent the file.


No you don't.  What are you talking about.  You don't have to do that look
ahead ever.  Also, random access is not harder at all.  I fear that you do
not quite understand what I am refering to when I suggested the XML 
routine.


I fear you are so buzzword-compliant your blinkers will never come off.

The lookahead to find a new message boundary need only be done once
with mbox, when you read it the first time.  The state machine you
need to use when reading an XML file is rather more complicated than
what you need when reading an mbox file.

Regarding random acceess: In XML, say you want to pull out (at random)
the 214th message in the store.  How do you do that?  If you memoize
the start-of-message locations in an mbox store, then you can very
easily look up the 214th entry in that table.

I never said that you take away evolutions ability to read mboxes.  Heck,
I'm using mboxes and fetchmail right now and still will reguardless of
whatever happens.  The thing that I'm saying is that since evolution is
already storing all the messages it is receiving why not store them in an
XML style format and if they want it would be a simple matter to export
the thing to mbox or .eml or Outlooks format (I'm assuming that outlook
has their own format but I don't really know for sure).


I've said why XML doesn't appear to be a good idea for a mail store:
 * poor MIME handling (at the very least, all the & and < and >
   increase parse time and storage space)
 * slower and more awkward to support really fast searches
 * more work to interoperate with other clients

I guess I still haven't given you an arguement for why it is better but I
also fear that it is because you're not completely understanding what I'm
saying or that I'm completely misunderstanding what you are saying.  You
haven't told me anything that says it is bad and are making lots of
statements like "if you don't do it right ..." and "if you do this to the
mbox ..."  I'm not trying to be an ass here, I'm just trying to figure out
why XML would totally blow as you seem to believe.


Your initial argument for using XML follows:

So I'm curious about why evolution email is not being stored in XML style
format instead of the mbox style it seems to be using.  Storing the in XML
could allow for a much improved search time, more versitile search
criterias, easier parsing of seperate messages, and could have a lot of
potential with a lot less code then would the current style of mail
storage you'll are using.


I think I've explained why XML does not improve search time or make
search criteria more flexible, and why finding message separators is
not hard to do with mbox.  I thought a fair bit about alternate
message file formats when I was designing my IMAP server, and in the
end mbox and mh were the leading contenders -- with mh losing because
of increased message access time and disk space allocation overhead.
mbox makes a lot of sense for a mail store, and if you add extra files
or in-memory state for operations you want to accelerate, you can make
it do a wide variety of things quickly.

Michael

Follow-Ups:
- Re: [Evolution] bad separation
  - From: NotZed

References:
- Re: [Evolution] bad separation
  - From: Sejal Patel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]