Re: [Evolution-hackers] compression for folders?



On Tue, 2004-05-11 at 10:28 -0500, Todd T. Fries wrote:
> I wish for everyone involved in this discussion to read my comments on
> bug 23621.  It's obvious from some of the comments that it has not been
> read.  I see no point in re-creating it here, as I was told that is the
> proper medium for appending constructive advice like comments.
> 
> That said, please understand:
> 
> - compression could be made (in my mind) completely transparent if a new
> folder that had compression in mind were given life

depends on how you define transparent ;-)

accessing mail in a large compressed folder is not going to be nearly as
fast as accessing mail in a non-compressed folder. You can't seek into a
gzipped (or bzip2'd) mbox file nearly as fast as you can seek into the
same uncompressed mbox. This is just math. So it's a proveable fact that
compressed folders *will* be noticable by users. So if "transparent" is
defined as "users won't know" then I believe you'd be wrong (well,
unless they are used to really slow mail access).

> - compression could be optional (you don't use it until you opt to
> convert a folder to a compressed format folder)

compression SHOULD NOT be an option for non-archive folders. period. the
performance impact alone is enough to say "Hell No".

> - archiving is separate from compression

this is where I think you're wrong. Compression only makes sense for
archived folders. anything else is going to piss more users off than
anything. I know, I know... "but it's just an option!". But if it's not
sensible for users to do, why offer it?

> - yes, it's fast to append to gzip and even bzip2 data streams

if you do this, you instantly kill read performance since the code will
have to specifically look for the end of each gzipped substream by
comparing 8 bytes each byte it reads. Sounds uber fast, don't it?

> , and yes
> it takes some cpu time to recrunch them;

a LOT of cpu. and a lot of I/O too.

>  for this reason in my proposed
> new folder type I suggested grouping messages to allow for a fair
> tradeoff between too big of an mbox as a single gzip stream vs every
> message compressed individually, both of which have obvious
> objectionable qualities (time vs space, respectively)

how do you propose this be done? I just don't see this being cost
effective. Sounds like a lot of work to come up with an algorithm for
this, and what's the benefit, really?

The simplest way is to simply gzip an entire mbox and be done with it.
And if you keep it to just archive folders like I suggest, then you
don't have to worry too much about performance penalties. It should be
rare enough that a user will modify the folder that the performance
negatives will be acceptable.

> - one could even allow for a background thread or a manually invoked
> thread that recompresses things in the background for a tighter fit;

yea... I'm gonna have to ask you to come in on Sunday... Oh, and it's
not a half day or anything, so you'll need to be at work at 9am. Yea...
</office space>

> access time doesn't suffer,

how will access time not suffer? (see above for a hint)

>  quick writes don't suffer

how will quick writes not suffer? I don't follow.

> , but recompressing
> can reclaim more diskspace especially if one opts to allow the
> recompressing program to attempt multiple algorithms to determine the
> tightest packing algorithm for a given dataset

sounds like a cpu chugfest to me.

> 
> Hopefully this will make it clear that, in my mind, short of manpower,
> the concepts of compression could be done in such a way that would not
> be objectionable to anyone.

your compression ideas are already pretty objectionable to me :-)

I think you'll find my approach not only easier to implement, but far
less resource intensive and "good enough" for most everyone's (those who
care about archive/compression support) usage scenarios.

But since you are going to look into hacking this yourself, feel free to
go with whatever you want. Don't let me stop ya ;-)

Jeff

> 
> On Mon, 2004-05-10 at 22:17, Not Zed wrote:
> > On Mon, 2004-05-10 at 16:02 -0700, Ray Lee wrote: 
> > > On Mon, 2004-05-10 at 15:28, Jeffrey Stedfast wrote:
> > > > you are forgetting the fact that folders are generally not read-only,
> > > > and so in order to write any new data to the gzip file, you'd have to
> > > > rewrite it from scratch which negates any speed improvements you could
> > > > possibly claim.
> > > 
> > > ray:~$ echo hello | gzip >test.gz
> > > ray:~$ echo world | gzip >>test.gz
> > > ray:~$ zcat test.gz
> > > hello
> > > world
> > > ray:~$
> > > 
> > > As long as the archive folders only support appending, there's no need
> > > to rewrite the entire file. Further, there's no need to even keep it in
> > > one big file (and many good reasons not to). Partition the archives by
> > > month, or something.
> > 
> > FWIW there is actually a reason to store them in one compressed stream
> > (vs catting them or separate files).  It will compress a lot better,
> > one large stream vs many smaller ones, there is a lot more redundant
> > data to compress.  Particularly considering the typical size of email
> > messages.
> > 
> > > > also, as a curiosity, I actually tested this theory and it doesn't hold
> > > > true. reading/inflating a gzip file off disk is no faster than reading
> > > > the non-compressed file off disk, *and* inflating the gzip file pegs the
> > > > cpu so if the app was doing other things then it would negatively impact
> > > > performance of those other operations.
> > > 
> > > This rather obviously depends on CPU speed versus disk speed, yes? If I
> > > had a modern CPU with a device that had a transfer speed of 1 byte a
> > > second, compressing the stream is an obvious win. If I have a device
> > > with a transfer speed of 1 GB/s, it's an obvious loss.
> > It also depends on other factors like i/o readahead, async i/o etc.  I
> > remember doing an async i/o based GIF decoder on an Amiga 500.  It
> > could decode raw gif at about the speed it could be loaded off floppy
> > (hmm, 7mhz!), without async i/o it bit, but with async i/o it was much
> > faster than loading the raw image would have been.  Still, compression
> > is usually much more expensive.
> > 
> > 
> > Michael Zucchi
> > <notzed ximian com>
> > 
> > Ximian Evolution and
> > Free Software Developer
> > 
> > 
> > Novell, Inc.
> -- 
> Todd Fries .. todd fries net
> 
>  _____________________________________________
> |                                             \  1.636.410.0632 (voice)
> | Free Daemon Consulting, LLC                 \  1.405.227.9094 (voice)
> | http://FreeDaemonConsulting.com             \  1.866.792.3418 (FAX)
> | "..in support of free software solutions."  \  1.700.227.9094 (IAXTEL)
> |                                             \          250797 (FWD)
>  \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
>                                                  
>               37E7 D3EB 74D0 8D66 A68D  B866 0326 204E 3F42 004A
>                         http://todd.fries.net/pgp.txt
> 
> 
> 
> 
> 
> _______________________________________________
> evolution-hackers maillist  -  evolution-hackers lists ximian com
> http://lists.ximian.com/mailman/listinfo/evolution-hackers
> 
-- 
Jeffrey Stedfast
Evolution Hacker - Novell, Inc.
fejj ximian com  - www.novell.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]