Re: [Evolution] bad separation

From: Michael Poole <poole troilus org>
To: Sejal Patel <sejal iname com>
Cc: evolution helixcode com
Subject: Re: [Evolution] bad separation
Date: 14 Jun 2000 16:17:10 -0400

Sejal Patel <sejal iname com> writes:

On 14 Jun 2000, Michael Poole wrote:

Can you elaborate on these?  XML really doesn't get you anything over
mbox format besides increased parse time.  XML also has a long track
record of dealing very poorly with MIME (in terms of how much data you
need to read and escape or un-escape, and check for literals).  In
particular, I think you're mistaken for these reasons:


An increased parse time is without a doubt going to be there just by the
general nature of building an XML tree.  One of the major things I was
thinking about is having an XML structure to serpate out messages,
seperate the headers and body in the messages, seperating out the field
types in the headers, and perhaps even seperating the signature out of the
body in the body type (I've been playing around with a rough draft of
doing this just to see the feasability and found it really
powerful).


It's really powerful until you separate too much and treat part of
somebody's message as a signature..

 Now, because these are broken up into a simplier formating
style (I've also been playing with making everything XML including
contacts and calendar and todo list) it was a lot easier to code up
interaction between all 3 things.


Because what are broken up into a "simplier formating style"?  You
don't need to keep a flat representation of the mbox file in memory
even if you use the (standard!) mbox format on disk.

 Searching can be reduced simply because
you don't have to do long complicated instructions to parse out the mbox
in comparison to a well designed XML thing because in normal searches and
filters you are looking ONLY for subjects or from or date or things like
that.  They are a breeze to do because the XML has already broken it down
into these and the XML parsers out there are very effecient at finding
these.


This goes back to the indexing I mentioned: in this, XML doesn't give
you anything that having an index per header field does not give you.
In fact, if you rely on the XML parsers' search functions for this,
you will miss valid aliases.  For example, mail sent to
"john(likes)@(to)annoy . you" would go to the same place as mail sent
to "Johnny Annoyance <john annoy you>", and relying on an XML search
fails to capture that.

I realise this and I didn't mean to imply that the disk format mattered
but the compatibility between things and the fact that it is much easier
to test and debug XML parsing plus they are already done
effeciently.  This is actually somewhat faster because it loads up the
stuff into an XML tree and can be traversed fairly quickly.  Even
searching for a string in the body would be easier because you would not
have to go through the headers each time looking for the body and stuff
before searching through the body.  Bad explanation I know but I'm hoping
you understand the concept behind it.


No matter how fast your XML parser or traverse is, a full search will
lose by having to chase pointers around your XML tree.  In addition,
just because the disk format is one way does not mean the in-memory
format must be the same.  I've written an IMAP server that does
indexing and searches on text and headers.  Its mail storage format is
mbox (plus extra index files).  I'd be willing to bet it has much
higher performance for searches than something that uses XML
internally.

I'm thinking that using XML will allow you to easily expand on the
things that you can do and allow you to modularize the code much more.


It's not hard to devise an architecture that gives you the same
modularity for searching and processing mbox-stored files.  I don't
think you need much more modularity than per-header and per-MIME-type
operations.

But looking for the first blank line after the headers and then figuring
out which blanks lines are part of this body and which ones are part of
the next body might not sound that complex but you are WASTING several CPU
instructions to do this which seems inefficient to me since there are
better more effecient ways of doing this such as XML.


You have to do that lookahead one (1) time with mbox.  With XML you
need to special-case &, <, and > everywhere they occur in your code
for reading.  Random access (say, by a message number in the store)
is also harder if you're using an XML tree to represent the file.

I wasn't talking about writing your own XML parsing.  In fact, I'm using
libxml to do the random XML parsing stuff that I am doing.  If I'm not
mistaken, evolution is already using libxml in their code.  This is a
thought that I've been working on and I find that it is especially
advantages when you have large mailboxes (like 100+ messages) and is also
very useful for contacts and calendaring.  It wasn't that much better for
the todo's but might as well keep it consistant.


Yes, I understand that Evolution already uses libxml for things like
configuration information.  It would be a gross mistake to think that
this inherently makes it a good way to store messages.

I think it's a fair assumption that any Unix mail reader must be able
to read mbox files (since many, if not most, Unix mail users have that
as their primary mail spool).  In my opinion, unless there are serious
flaws in that format, there's not much reason to switch to another
format.  (And yes, reader/writer conflicts being able to hose an mbox
store is a good reason to use accessories like file locking with
standard mbox.)

Michael

Follow-Ups:
- Re: [Evolution] bad separation
  - From: Sejal Patel

References:
- Re: [Evolution] bad separation
  - From: Sejal Patel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]