Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all

From: Philip Van Hoof <spam pvanhoof be>
To: Jeffrey Stedfast <fejj novell com>
Cc: Evolution Hackers <evolution-hackers gnome org>
Subject: Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all
Date: Mon, 05 Jan 2009 15:05:27 +0100

On Mon, 2009-01-05 at 08:25 -0500, Jeffrey Stedfast wrote:

> migrating away from the IMAP specific data cache would be good.

Yes. I think IMAP and the local providers are the only ones that are
still using a specialized datacache.

The IMAP4 one, for example, ain't using a specialized one.

> >> b) migrate away the mbox data cache (the all-in-one file crap)
> >>     
> > I'm all for it. Once I thought of doing this, but the options were like
> > Maildir or a format of one mbox file per mail in a distributed folder
> > [CamelDataCache sort of format, like imap4/GW/Exchange]. But IIRC Fejj,
> > had some concern like, Local still might be good to be held in a
> > 'standards' way. I know it hurts us on expunge/mailbox rewrite etc.
> >   
> 
> what mbox data cache? CamelDataCache would probably be the best cache to
> use for IMAP.

Although I would change CamelDataCache to store individual MIME parts as
separate files instead of files that look like a single-mail MBox file.

I would also decode the separate MIME parts before storing if the
original E-mail had them encoded (which is usually the case, and always
for binary attachments). This to make it more easy for metadata engines
to index the MIME parts, and to allow such to do this efficiently. 

Perhaps also to reduce disk-space, as encoded consumes more disk-space,
but that is for me just a nice side-effect.

So my format would create a directory foreach E-mail, or prefix each
MIME part with the uid. Perhaps

INBOX/subfolders/temp/1.              // headers+multipart container
INBOX/subfolders/temp/1.1             // multipart container
INBOX/subfolders/temp/1.1.1           // text/plain
INBOX/subfolders/temp/1.1.2           // text/html
INBOX/subfolders/temp/1.2.1           // inline JPeg attachment
INBOX/subfolders/temp/1.BODYSTRUCTURE // Bodystructure of the E-mail
INBOX/subfolders/temp/1.ENVELOPE      // Top envelope of the E-mail

ps. Perhaps I would store 1.BODYSTRUCTURE in the database instead. I
would probably store 1.ENVELOPE in the database (like how it is now).

I would probably on top of storing BODYSTRUCTURE and ENVELOPE in the
database also store them in separate files. Even if most filesystems
will consume 4k or more (sector or block size) for those mini files.

To get the JPeg attachment:

$ cp INBOX/subfolders/temp/1.2.1 ~/mommy.jpeg

$ exif INBOX/subfolders/temp/1.2.1
EXIF tags in 'INBOX/subfolders/temp/1.2.1' ('Intel' byte order):
--------------------+----------------------------------
Tag                 |Value                                                     
--------------------+----------------------------------
Image Description   |Mommy with cake at birthday 
Manufacturer        |SONY                                                      
Model               |DSC-T33                                                   
...

$ tracker-search -s EMails birthday
Results:
  email://user server/INBOX/temp/1
  email://user server/INBOX/temp/1#2.1
  ~/mommy.jpeg

[CUT]

> this can cause problems if you need to verify signed parts because
> re-encoding them might not result in the same output.

Ok, for signatures I guess we can make an exception and keep then
encoded in their original format then.

> >> For Maildir I recommend wasting diskspace by storing both the original
> >> Maildir format and in parallel store the attachments separately.
> >>
> >> Maildir ain't accessible by current Evolution's UI, by the way.
> >>
> >> For MBox I recommend TO STOP USING THIS BROKEN FORMAT. It's insane with
> >> today's mailboxes that easily grow to 3 gigabytes in size per user.
> >>     
> > I second your thoughts for MBox stuff. 
> >   
> 
> Eh, I think mbox works fine but I can understand wanting to move to
> Maildir which is also fine :-)

Maildir doesn't store individual MIME parts separately. So Mailbox is
equally hard to handle for metadata engines as MBox is. Only difference
with MBox is that we need to seek() to some location.

So Maildir doesn't make it possible for us to let app developers
implement indexing plugins easily, like a typical exif extractor.

We would have to Base64 decode image attachments before extracting exif,
for example. Instead of just saying: here's a stream, or here's a FILE*,
go ahead and extract the info you want. (with a stream we could make it
relatively easy to auto-base64 decode, but often are these extractors
still FILE* based, not stream based).

There's IMO not really a good reason to keep the attachments stored in
their encoded version. Except the signatures, perhaps, but we don't
really need those in decoded form anyway. So it would be fine to have an
exception on signatures (to keep them encoded-stored).

Hmmaybe someday having the fingerprint information about a person might
be useful to verify the identify of an individual before linking the
person with a contact in our RDF triple store.

-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be

Follow-Ups:
- Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all
  - From: Jeffrey Stedfast

References:
- [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all
  - From: Philip Van Hoof
- Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all
  - From: Srinivasa Ragavan
- Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all
  - From: Jeffrey Stedfast

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]