Re: GNOME Message Managment System (Storage mechanism)




Hello Mr. Wimer,


On Sun, 18 Apr 1999 04:58:08 -0700 (PDT)
Scott Wimer <scottw@dev.cgibuilder.com> wrote:

First of all, I thank you for summarizing the ealier discussion into
your proposed design.  It will benefit all list members.

> This used to be the mail client thread.  But, we started discussing
> stuff well beyond just a simple mail client.
> 
> Here's my take on a decent Message storage aproach.  I'm trying to
> optimize a certain number of activities here:
>     fast Message storage
>     ease of Message grouping, allow multiple grouping
>     ease of Message ordering
>     scalable Message group sizes
>     minimal space wasted
>     fast Message retrieval
>     ability to regenerate corrupted indexes (for ordering and grouping)
>     fast Message searching abilities
I like what your proposed design accomplishes.

[snip]
> I think this storage architecture should work rather well for meeting
> the above goals.  Here we go.  Some of this has shown up in earlier
> emails, this is mostly a condensing of other data and an expansion on
> a couple of points.
>     Each Message is given a unique Message ID.
Yes.  This is a must.  'A public db key' as you phrased earlier.

>     Message Composition
>         <Message ID>
>         [group1 [,groupN]* ]*
>         Original Message Header
>         Original Message Body
>     Message ID Composition
>         (date of arival)-(random 10 character string)
>         The date of arival is a 32 bit integer, time since epoch
>         The random string is open for discussion, basically, just
>             there to make the message ID's unique.
Please allow me to save these points for a separate discussion.

>     Message Storage begins at a single directory on the disk
>         Softlinks out to other directories are allowed
>         This simplifies the code for accessing the Messages themselves
I fail to see this.  The client program should not be directory
calling the file system function, because they have to support
multiple file systems over multiple platforms.  A wrapping layer is
almost always a necessity.  If you use a database, the client program
can directory call database function, because the difference in the
file systems or the platforms are already wrapped in the database
implementation.

>         This simplifies people moving their Message store location
I fail to see this either.  There is no difference between a single
directory and a single database device (which is usually implemented
as a disk file), as far as relocation or backup purpose is concerned.

>     Messages are stored across several directories under the base dir
>         The target directory is chosen by a hash of the Message ID
>         Multiple levels of hashing are allowed
>             This reduces directory size
>         Each Message is written to a separate file
>             File name = Message ID
You do not have to care about these with a database system.

>     The Main Message Index can contain pointers to sub indexes
>         Allows for sub groupings
>         An Index exists for each Grouping defined
>         A Message may be Indexed under multiple Groups
This is good.

>     A Message Index entry looks like:
>         <Message ID> <Subject> <Location>
>         The Location is the path to the Message file
>             This is relative to the base Message directory
I fail to see the merit of mixing file system into the picture. Let's
compare the performance of two systems, one with file system and
another without.

[The Task Scenario]
A client wants to read messages with a title "FOO".
There are five matches and the index knows the corresponding 5
Message-ID's. 

[Database with File System]
The index also knows the file system path's to those 5 FOO's.  It
requests those files from the file system.  The file system then
returns the files to the requesting client.  The client uses whatever
the method it knows to read the data into the memory.

[Database Only]
The index knows the seek pointer offset's and byte size of those 5
FOO's.  It reserves a buffer of the appropriate size to store all 5,
moves its seek pointer 5 times, and read the data to the buffer.

On a single user system, there probably isn't any performance
difference between these two.  If there are 20 users making similar
request simultaneously, I bet users begin to notice that the Database
Only system is slightly faster.  On a 1000 user system, I doubt that
the File System approach would yield the kind of performance that I
can live with.

It does not appear to be the performance issue you are after. It
seems to me that that there is another reason for which we have to
use file system, and you are trying to device a method within the
constraint.

> If a Message directory grows to have more than a thousand or so
> entries in it, accessing each individual Message will be slow, since
> most directory lookups are linear.  This system can be made self
> balancing though.  On startup, we could check to see if any directory
> had more than some arbitrary number of entries in it, say 600.  If it
> did, then we would add another 10 numbered entries to the directory,
> and then re hash the Messages currently stored.  This shouldn't be
> as slow as it sounds, since we will probably be able to re link the
> Message files into the new directories.  So, it's just a bunch of
> directory table updates (and we're not letting these directories get
> huge on us), without a lot of copying happening.
> 
> The end result is a storage system that looks like this:
> 
> ~/Message/Base/Group1.db
> ~/Message/Base/GroupN.db
> ~/Message/Base/Store/[0-9]
> ~/Message/Base/Store/[0-9]/924436539-j3902mdnijH
> ~/Message/Base/Store/[0-9]/924432923-02df23ds93j
> ~/Message/Base/Store/[0-9]/924433610-923oiUIHli8
I think this will scale if the directories are spun over multiple
physical deveices or computers.  Otherwise, it only makes the
performance worse.

Overall, I like your idea.  I just don't see the reason to mix a file
system into the picture.

Zen

----------/* E-Mail : <atmczen@ibm.net > */----------



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]