Re: [Evolution-hackers] compression for folders?



Ok.  So the initial feedback `please add comments to this bug' were not
proper.  No biggie.  You've got the gist of what is in that bug report
anyway.  I'll attach it to this message incase it is of further value.

What I'm expecting to do fits very nicely under a separate backend.  I
have repeatedly stated I do not expect to compress an mbox, it makes no
sense, some people don't seem to hear that.

I'll work on this from a separate backend perspective.  Once I have
something working (or have questions) I'll check back.

On Tue, 2004-05-11 at 18:50, Not Zed wrote:
> On Tue, 2004-05-11 at 10:28 -0500, Todd T. Fries wrote: 
> > I wish for everyone involved in this discussion to read my comments on
> > bug 23621.  It's obvious from some of the comments that it has not been
> > read.  I see no point in re-creating it here, as I was told that is the
> > proper medium for appending constructive advice like comments.
> I'm not online at the moment so I can't, but in general bugs are good
> for tracking features and summarising (and for bugs), it isn't very
> convenient for general discussion, particularly in preliminary design
> phases or for nutting out details - which is precisely what this list
> is for. 
> > That said, please understand:
> > 
> > - compression could be made (in my mind) completely transparent if a new
> > folder that had compression in mind were given life
> > - compression could be optional (you don't use it until you opt to
> > convert a folder to a compressed format folder)
> So youre saying this could apply to any arbitrary folder?  Any local
> arbitrary folder?
> 
> The main reason i'm against this is it complicates the code somewhat. 
> A rather somewhat.  Each backend is pretty specialised in what it
> does.  It operates on a specific storage format.  This is by design
> and on purpose, it means each backend can be relatively simple, and
> only needs to abstract the api into that storage format.  Adding extra
> layers ontop of some folders will complicate development and
> maintenance. 
> > - archiving is separate from compression
> Well, yes and no.  Why do you want to compress folders?  Only because
> you have a shitload of mail, that you basically never delete.  That
> sounds a whole lot like an online archive, to me.  You can play with
> words and semantics, but at the end of the day, there's little
> difference.
> 
> Lets just drop the archive bit and call it a separate compressed
> backend then?  Archiving would mostly be a function of the frontend
> anyway.  But having a known efficient backend storage for it, would
> make it easier to write in the frontend. 
> > - yes, it's fast to append to gzip and even bzip2 data streams, and yes
> > it takes some cpu time to recrunch them; for this reason in my proposed
> > new folder type I suggested grouping messages to allow for a fair
> > tradeoff between too big of an mbox as a single gzip stream vs every
> > message compressed individually, both of which have obvious
> > objectionable qualities (time vs space, respectively)
> If you're using a different mailbox format, then you need another
> backend, end of story.
> 
> Maybe we're just talking about different things here.  Backend (where
> the code actually gets done to do the work) vs frontend functionality,
> like selecting an alternate mailbox format.
> 
> In 1.5.x we no longer have the option to modify the mailbox format,
> although at some point in the future this may be doable - for all
> local folders though, not on a per-folder basis.  But certainly, if
> you had a compressed-option backend, then you could have it per-folder
> based on its functionality.
> 
> On the other hand, with the way backends are plugged in - it makes
> really little difference.  If you dont use the stuff under "On this
> computer", then no mail goes there (apart from the outgoing spool).
> 
> Alll you do is close that tree down, and use your compressed-local
> backend.
> 
> Whats the big deal? 
> > - one could even allow for a background thread or a manually invoked
> > thread that recompresses things in the background for a tighter fit;
> > access time doesn't suffer, quick writes don't suffer, but recompressing
> > can reclaim more diskspace especially if one opts to allow the
> > recompressing program to attempt multiple algorithms to determine the
> > tightest packing algorithm for a given dataset
> Sure, there's a ton of things you can do.  I just don't want you doing
> it in the mbox code.  The mbox code is for writing to mbox's.  Once
> you compress it, especially if it isn't just a single stream, then
> you're no longer a berkely mailbox. 
> > Hopefully this will make it clear that, in my mind, short of manpower,
> > the concepts of compression could be done in such a way that would not
> > be objectionable to anyone.
> The reason i suggest doing it as a separate backend are manyfold:
> - makes little real difference to the user.  Users cope with IMAP
> pretty easily; it would show up the same way.
> - most users don't need this for normal working folders, i would argue
> nobody does in that case.
> - it doesn't belong in the others.  Especially since it would
> presumably be a different storage format entirely, and not even merely
> a compression of otherwise identical objects.
> - its a backend.  It has to go in the backends.  Backends aren't the
> frontend, and can be/are hidden from the user anyway.
> - it can be developed in parallel, independently.  No objection to it
> going into the main CVS (i would encourage it - infact it could
> probably fit in the local provider, but it has to be a different type,
> not a layer ontop), but it needn't even do that.  This also lowers the
> risk, adding major new features to an existing backend in which people
> have gigabytes of 'mission critical' email, isn't low risk.
> 
> This last point is reinforced by the other facts:
> - the api's aren't that simple and there's a lot of stuff to learn and
> to implement
> - almost none of the existing code in a given backend will be
> re-usable as soon as you change the storage format.  Thats all they're
> for after all, abstracing the storage format.
> 
> Again, maybe we're just misunderstanding where you're talking about
> (again i apologise for not being able to read the bug report - i was
> busy this week fixing bugs, and am offline for a few hours).
> 
> The strongest point is really the parallel development angle.  You can
> provide all the functionality without interfereing with any of the
> core in any way.  I mean you could potentially just take the mailbox
> one, and develop a compressed one in parallel.  Once it is stable,
> then merge the code, or not, as appropriate.  Since its all abstracted
> anyway, you'll have to do this all anyway.
> 
> Michael
> 
> 
> Michael Zucchi
> <notzed ximian com>
> 
> Ximian Evolution and
> Free Software Developer
> 
> 
> Novell, Inc.
-- 
Todd Fries .. todd fries net

 _____________________________________________
|                                             \  1.636.410.0632 (voice)
| Free Daemon Consulting, LLC                 \  1.405.227.9094 (voice)
| http://FreeDaemonConsulting.com             \  1.866.792.3418 (FAX)
| "..in support of free software solutions."  \  1.700.227.9094 (IAXTEL)
|                                             \          250797 (FWD)
 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
                                                 
              37E7 D3EB 74D0 8D66 A68D  B866 0326 204E 3F42 004A
                        http://todd.fries.net/pgp.txt




------- Additional Comments From todd fries net 2004-05-04 13:46 -------


As directed by the hackers mailing list, adding my $.02 to this. 
 
I recall one clear detail from my OS courses.  It costs many cpu 
cycles to read one block of data off of the hard drive.  I do not 
know how many in todays ratios, but for the sake of argument, I 
think it is still a safe bet that to decompress and compress data is 
faster than to load uncompressed data off of a hard drive. 
 
That said, there stands to be some efficiency tradeoffs.  I've used 
mutt for years now, with the compression option.  It works, however 
it is more or less a hack.  Before opening a file ending in '.gz' or 
'.bz2', mutt decompresses it to a temporary working directory, and 
then re-compresses it back when finished.  This works ok, but when 
you have a single mbox file that is 480mb ending in .bz2, and a 
decompression time of many minutes on a fast cpu, you begin to 
realize there must be a better way. 
 
While I now understand (after reading the archival description 
above) the concept of archival, I have yet to understand how it 
would work for me.  I want to view all emails from my wife, for 
example, and so long as the vfolder searches can also search the 
`archive' then whatever is implemented with this functionality in 
mind I am not going to complain. 
 
Personally, I see two feature requests in this bug id.  One is 
archival, the other is compression.  Both related, but distinctly 
separate. 
 
I personally would love to have an option per mailbox under 
preferences per mailbox to choose to compress the mailbox, and also 
a choice with regards to the compression mechanism. 
 
I do not know if it would make sense to compress mbox mailboxes, but 
perhaps if one understands that it would simply take a long time on 
larger mailboxes, then this could be accepable. 
 
I also do not know if it would make sense to compress Maildir 
mailboxes, given that the average email message is so small, adding 
a gzip header would add negligible space savings. 
 
For the same reason above, I do not know if it would make sense to 
compress individual email messages, either in a modified mbox file 
format (aka 'From ...' line unmodified, the rest compressed, or a 
special attachment type that is a compressed body), or in some other 
mannerism. 
 
It seems to me, that the best solution would be to invent a new 
mailbox format specifically with compression in mind.  I would 
personally suggest such a format have a few user-tweakable knobs, 
with sane defaults: 
 
   - groups of messages are compressed together, the groups are 
     indexed, similar to zip format, as individual compression 
     blocks, for `quick' retrieval in large mail boxes 
   - the message groups could be defined as a threshold of 
     a number of messages, or a size of messages; for example: 
        every 100 messages is a group of messages in a single 
          compression block 
        -or- 
        every time the 100k barrier is broken, a new group of 
          messages would be started 
   - the compression mechanism could be specified as: 
       - bzip2 (1 .. 9) 
       - gzip  (1 .. 9) 
       - compress 
       - [any others?] 
       - autoselect 
      .. with `auto' each compression block would be tested with 
         each available compression mechanism, to determine the 
         one that compresses best; while it is true that one tends 
         to be the winner in general, in some specific cases I've 
         seen others better, and in some cases I've seen the 
         compressed output of one pass is further compressable by 
         a 2nd pass. 
 
Could anyone suggest anything that is wrong with the above thinking?  
My intention is to catch up to current development cvs HEAD, and see 
what I can do about the new mailbox format. 
 
P.S.  One could also, with a new mailbox format, add a layer of 
encryption as an optional 2 pass `decompression' mechanism. 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]