Re: [Evolution-hackers] compression for folders?



I find your arguments amusing.

On one hand, you diss my ideas in general because of a specific case
that compressing an mbox is too cpu and io intensive of a process to be
a useful mechanism.  Then you turn around and argue for this very thing
because it is simple.

Let me be clear.

I have experienced the very mbox compression madness with mutt.

It is true that you can append gzip streams one ontop of another, and
decompression never knows the difference from my perspective.

Aka:

   echo hi | gzip > foo.gz
   echo ho | gzip >> foo.gz
   echo fun| gzip >> foo.gz
   zcat foo.gz
   hi
   ho
   fun

.. works just fine.

Mutt utilizes this ad nauseum.  To save to a folder with a '.gz'
extension, it simply does 'gzip -9 < /tmp/tempmessage >> folder.gz'.  To
read a folder with a '.gz' extension, it decompresses the whole thing to
a temp dir, does a normal mbox open on it, and recompresses the whole
thing before writing it back to the original folder name if you make any
modifications.

Trust me, I know what you object to when I say clearly:  I am not
proposing we compress mbox'es and call it a day.

I agree this is an objectionable mechanism, too io and cpu intensive,
etc.

The alternative that would cost nearly zero overhead would be to
compress individual message bodies with some kindof compression
attachment format.  This would allow seek'ing to be equal to that of
mbox'es and/or Maildir etc, but permit the biggest part of large emails
to be compressed.

Personally, I suggest this as the `other extreme' only, and find it
objectionable because it doesn't compress everything, and I find it
distasteful.  It also wastes the overhead of compression headers.

I think it goes without saying that less overhead equals less headers
equals more data compressed together than just a single email message.

With the above in mind, I thought perhaps grouping email messages at
some threshold would make sense.

Once you start grouping email messages together, you either apply some
really ugly hacks or you invent a new email folder.

Personally, it makes sense to me to utilize something like the .zip
format, where you can have an index to the different chunks of data, and
can seek to that offset and decompress only what you need to, instead of
the whole folder.

Because you can seek and decompress selectively, I presumed this would
in effect nix any 'massive io overhead and cpu time' vs 'disk io time
with decompressed data' vs 'disk io time with compressed data' style
arguments.

Searching bodies of emails would be a bear, but it is that already.

Because it is easy to 'add' emails to an archive, one could do it in
small groups or one at a time.  This could be sufficient, but
re-compressing email messages into larger groups than 1 email per group
would yield better results.

This is what I meant by a background thread that recompressed email
folders.

Hopefuly these comments will provide for you some more insight into my
thought processes.

Of course, I expect it to be optional, only enabled if people wish to
compress their email.  Most users don't have that to worry about, but if
I wish to continue carrying all the email I've ever received with me,
this is mandatory in my book.  Well, not mandatory, but I choose that
over having a laptop hard drive loaded with uncompressed (yet highly
compressable) email and nothing else.

On Tue, 2004-05-11 at 15:08, Jeffrey Stedfast wrote:
> On Tue, 2004-05-11 at 10:28 -0500, Todd T. Fries wrote:
> > I wish for everyone involved in this discussion to read my comments on
> > bug 23621.  It's obvious from some of the comments that it has not been
> > read.  I see no point in re-creating it here, as I was told that is the
> > proper medium for appending constructive advice like comments.
> > 
> > That said, please understand:
> > 
> > - compression could be made (in my mind) completely transparent if a new
> > folder that had compression in mind were given life
> 
> depends on how you define transparent ;-)
> 
> accessing mail in a large compressed folder is not going to be nearly as
> fast as accessing mail in a non-compressed folder. You can't seek into a
> gzipped (or bzip2'd) mbox file nearly as fast as you can seek into the
> same uncompressed mbox. This is just math. So it's a proveable fact that
> compressed folders *will* be noticable by users. So if "transparent" is
> defined as "users won't know" then I believe you'd be wrong (well,
> unless they are used to really slow mail access).
> 
> > - compression could be optional (you don't use it until you opt to
> > convert a folder to a compressed format folder)
> 
> compression SHOULD NOT be an option for non-archive folders. period. the
> performance impact alone is enough to say "Hell No".
> 
> > - archiving is separate from compression
> 
> this is where I think you're wrong. Compression only makes sense for
> archived folders. anything else is going to piss more users off than
> anything. I know, I know... "but it's just an option!". But if it's not
> sensible for users to do, why offer it?
> 
> > - yes, it's fast to append to gzip and even bzip2 data streams
> 
> if you do this, you instantly kill read performance since the code will
> have to specifically look for the end of each gzipped substream by
> comparing 8 bytes each byte it reads. Sounds uber fast, don't it?
> 
> > , and yes
> > it takes some cpu time to recrunch them;
> 
> a LOT of cpu. and a lot of I/O too.
> 
> >  for this reason in my proposed
> > new folder type I suggested grouping messages to allow for a fair
> > tradeoff between too big of an mbox as a single gzip stream vs every
> > message compressed individually, both of which have obvious
> > objectionable qualities (time vs space, respectively)
> 
> how do you propose this be done? I just don't see this being cost
> effective. Sounds like a lot of work to come up with an algorithm for
> this, and what's the benefit, really?
> 
> The simplest way is to simply gzip an entire mbox and be done with it.
> And if you keep it to just archive folders like I suggest, then you
> don't have to worry too much about performance penalties. It should be
> rare enough that a user will modify the folder that the performance
> negatives will be acceptable.
> 
> > - one could even allow for a background thread or a manually invoked
> > thread that recompresses things in the background for a tighter fit;
> 
> yea... I'm gonna have to ask you to come in on Sunday... Oh, and it's
> not a half day or anything, so you'll need to be at work at 9am. Yea...
> </office space>
> 
> > access time doesn't suffer,
> 
> how will access time not suffer? (see above for a hint)
> 
> >  quick writes don't suffer
> 
> how will quick writes not suffer? I don't follow.
> 
> > , but recompressing
> > can reclaim more diskspace especially if one opts to allow the
> > recompressing program to attempt multiple algorithms to determine the
> > tightest packing algorithm for a given dataset
> 
> sounds like a cpu chugfest to me.
> 
> > 
> > Hopefully this will make it clear that, in my mind, short of manpower,
> > the concepts of compression could be done in such a way that would not
> > be objectionable to anyone.
> 
> your compression ideas are already pretty objectionable to me :-)
> 
> I think you'll find my approach not only easier to implement, but far
> less resource intensive and "good enough" for most everyone's (those who
> care about archive/compression support) usage scenarios.
> 
> But since you are going to look into hacking this yourself, feel free to
> go with whatever you want. Don't let me stop ya ;-)
> 
> Jeff
> 
> > 
> > On Mon, 2004-05-10 at 22:17, Not Zed wrote:
> > > On Mon, 2004-05-10 at 16:02 -0700, Ray Lee wrote: 
> > > > On Mon, 2004-05-10 at 15:28, Jeffrey Stedfast wrote:
> > > > > you are forgetting the fact that folders are generally not read-only,
> > > > > and so in order to write any new data to the gzip file, you'd have to
> > > > > rewrite it from scratch which negates any speed improvements you could
> > > > > possibly claim.
> > > > 
> > > > ray:~$ echo hello | gzip >test.gz
> > > > ray:~$ echo world | gzip >>test.gz
> > > > ray:~$ zcat test.gz
> > > > hello
> > > > world
> > > > ray:~$
> > > > 
> > > > As long as the archive folders only support appending, there's no need
> > > > to rewrite the entire file. Further, there's no need to even keep it in
> > > > one big file (and many good reasons not to). Partition the archives by
> > > > month, or something.
> > > 
> > > FWIW there is actually a reason to store them in one compressed stream
> > > (vs catting them or separate files).  It will compress a lot better,
> > > one large stream vs many smaller ones, there is a lot more redundant
> > > data to compress.  Particularly considering the typical size of email
> > > messages.
> > > 
> > > > > also, as a curiosity, I actually tested this theory and it doesn't hold
> > > > > true. reading/inflating a gzip file off disk is no faster than reading
> > > > > the non-compressed file off disk, *and* inflating the gzip file pegs the
> > > > > cpu so if the app was doing other things then it would negatively impact
> > > > > performance of those other operations.
> > > > 
> > > > This rather obviously depends on CPU speed versus disk speed, yes? If I
> > > > had a modern CPU with a device that had a transfer speed of 1 byte a
> > > > second, compressing the stream is an obvious win. If I have a device
> > > > with a transfer speed of 1 GB/s, it's an obvious loss.
> > > It also depends on other factors like i/o readahead, async i/o etc.  I
> > > remember doing an async i/o based GIF decoder on an Amiga 500.  It
> > > could decode raw gif at about the speed it could be loaded off floppy
> > > (hmm, 7mhz!), without async i/o it bit, but with async i/o it was much
> > > faster than loading the raw image would have been.  Still, compression
> > > is usually much more expensive.
> > > 
> > > 
> > > Michael Zucchi
> > > <notzed ximian com>
> > > 
> > > Ximian Evolution and
> > > Free Software Developer
> > > 
> > > 
> > > Novell, Inc.
> > -- 
> > Todd Fries .. todd fries net
> > 
> >  _____________________________________________
> > |                                             \  1.636.410.0632 (voice)
> > | Free Daemon Consulting, LLC                 \  1.405.227.9094 (voice)
> > | http://FreeDaemonConsulting.com             \  1.866.792.3418 (FAX)
> > | "..in support of free software solutions."  \  1.700.227.9094 (IAXTEL)
> > |                                             \          250797 (FWD)
> >  \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
> >                                                  
> >               37E7 D3EB 74D0 8D66 A68D  B866 0326 204E 3F42 004A
> >                         http://todd.fries.net/pgp.txt
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > evolution-hackers maillist  -  evolution-hackers lists ximian com
> > http://lists.ximian.com/mailman/listinfo/evolution-hackers
> > 
-- 
Todd Fries .. todd fries net

 _____________________________________________
|                                             \  1.636.410.0632 (voice)
| Free Daemon Consulting, LLC                 \  1.405.227.9094 (voice)
| http://FreeDaemonConsulting.com             \  1.866.792.3418 (FAX)
| "..in support of free software solutions."  \  1.700.227.9094 (IAXTEL)
|                                             \          250797 (FWD)
 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
                                                 
              37E7 D3EB 74D0 8D66 A68D  B866 0326 204E 3F42 004A
                        http://todd.fries.net/pgp.txt







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]