Re: Summary store, impl n. 8

From: Philip Van Hoof <spam pvanhoof be>
To: Dave Cridland <dave cridland net>
Cc: Jeffrey Stedfast <fejj novell com>, tinymail-devel-list gnome org
Subject: Re: Summary store, impl n. 8
Date: Fri, 04 Jan 2008 00:31:57 +0100

Note that opening this highly redundant sample, you get the following
numbers:

VmPeak:     9368 kB
VmSize:     9368 kB
VmLck:         0 kB
VmHWM:      7600 kB
VmRSS:      7600 kB
VmData:     7084 kB
VmStk:        84 kB
VmExe:        16 kB
VmLib:      2152 kB
VmPTE:        16 kB

Of course is VmSize very small because all the strings are redundant
(and therefore it's only once mapped in memory).

We can see that VmSize is equal to VmRss (only a very small difference)

This is an example where every string (25 bytes per string) is unique. I
just added a %d to each and every string, and I used the i variable in
the loop to give that %d a value.

VmPeak:    22076 kB
VmSize:    22076 kB
VmLck:         0 kB
VmHWM:      7604 kB
VmRSS:      7604 kB
VmData:     7084 kB
VmStk:        88 kB
VmExe:        16 kB
VmLib:      2152 kB
VmPTE:        16 kB


This is the result that I want, indeed. We can clearly see that VmSize
is at 22MB whereas VmRss (which is what matters) is still at the exact
same size as the sample above.

The data_0.mmap file was 13MB in size in that example.

I attached the summary.c file so that you can reproduce this test
yourself too (it has a getchar() somewhere, so that you can read the
statistics while the software waits for stdin's character).




On Fri, 2008-01-04 at 00:19 +0100, Philip Van Hoof wrote:
> On Tue, 2008-01-01 at 23:26 +0100, Philip Van Hoof wrote:
> > Interesting idea about "cc" and "to" addresses.
> 
> Experimentally implemented. It's the caller who'll create SummaryItems
> who'll have to provide sorted addresslists by modifying his ENVELOPE
> parser in such a way that convenient strings are passed.
> 
> I have added to code internally to show how it would work.
> 
> The example has the to and cc in reversed order, if you make the (just
> add one (any) argument to the cmdline when running) and open the
> data_0.mmap file, you'll notice that only one string is stored for both
> cc and to (the example reuses all strings 50,000 times, but it's of
> course an sample with extremely redundant data).
> 
> 
> 
> >  dvalue :)
> > pvanhoof And flag changes of items happen in a separate file (this is for email)
> > pvanhoof Nah, they are always strings. I don't think I need a generic type
> > joh desrt: Can't wait to see dconf finished :)
> > desrt pvanhoof; it's more for the convenience factor
> > desrt there's a really nice "write this to a binary file" API
> > pvanhoof But I'm sure following up on dvalue and dconf :)
> > desrt and "read this out of a binary file" API
> > * tbf (~mathias ip-80-226-0-1 vodafone-net de) has joined #gnome-hackers
> > desrt what is the file a mapping of, anyway?
> > desrt ID numbers to strings?
> > pvanhoof summary data, that's your subject, cc, to, from, uid, sequence number
> > pvanhoof And the flags of your email
> > pvanhoof or, the ENVELOPE in IMAP terminology
> > desrt so you want to store common "from" and "to" fields without duplication
> > pvanhoof aye
> > desrt cool
> > pvanhoof And most referred to grouped together
> > pvanhoof The goal is to have fewer 4k pages in real memory modules
> > desrt is the to a string or an array of strings?
> > desrt (think: multiple recipients)
> > pvanhoof And most of the less frequently used ones in virtual memory
> > pvanhoof I handle all of the summary items as just strings
> > desrt recogniser your to/cc strategy
> > pvanhoof To save pointers 
> > desrt er
> > desrt *reconsider
> > pvanhoof Yes, true. cc and to could be lists of strings to find even more duplicates
> > pvanhoof but a list would also mean, more pointers in my heap
> > desrt it's true.
> > desrt but chances are that a huge cc: list isn't going to be exactly the same every time around
> > desrt although you could improve your chances by sorting the list
> > desrt so at least permutations don't count as different entities
> > pvanhoof Agree, but chances that a user has a lot of huge cc lists are rather small (spammers don't do this a lot :p)
> > desrt :)
> > pvanhoof yep, I sort the strings on frequency used
> > desrt no no
> > pvanhoof so those unique strings will be at the tail of the file
> > desrt i mean if you have
> > desrt cc: pvanhoof gnome org, desrt desrt ca
> > tbf pvanhoof: you really don't know normal people....
> > desrt then later
> > desrt cc: desrt desrt ca, pvanhoof gnome org
> > * RealNitro has quit (Ex-Chat)
> > pvanhoof oh yes
> > desrt ^ these are the same, but would be stored separately
> > pvanhoof that would be two strings, four pointers .. 
> > desrt well
> > desrt using my method it would be
> > desrt using your method it would be 2 (long) strings, 2 pointers
> > dobey joh: store the "bookmarked" list in a different key
> > desrt if you sort the list and then use your method
> > pvanhoof right now this is two strings two pointers, but for larger samples it makes sense to store addresses separately yes
> > tbf pvanhoof: mails regarding my fencing club usually have quite large CC lists
> > desrt then it's 1 long string and 2 pointers
> > desrt you're still not listening to me :p
> > dobey joh: though i don't know how you're storing anything currently
> > pvanhoof I am I am :)
> > desrt i mean that you should take the _string_ "pvanhoof gnome org, desrt desrt ca" and convert it to the string "desrt desrt ca, pvanhoof gnome org"
> > pvanhoof ah yes
> > pvanhoof interesting
> > desrt since that way you avoid having lots of copies of what is essentially the same list
> > pvanhoof Just sort the string's substrings 
> > desrt exactly
> > pvanhoof that's an interesting idea indeed
> > desrt ok.  dinner time.  bye :)
> > lifeless fails when you have 2 and 3 element long strings with common elements :)
> > 
> > 
> > On Tue, 2008-01-01 at 23:08 +0100, Philip Van Hoof wrote:
> > > On Tue, 2008-01-01 at 20:31 +0000, Dave Cridland wrote:
> > > > On Thu Dec 27 19:53:03 2007, Philip Van Hoof wrote:
> > > 
> > > > > o. Stores flags in a separate file. Flags are the only type of data  
> > > > > that changes relatively often in the summary. Therefore a separate  
> > > > > file.
> > >  
> > > > Hmmm... FLAGS and MODSEQ. I'm not clear if you really need MODSEQ to  
> > > > be stored at all in the client, mind.
> > > 
> > > If the MODSEQ number is not often changed (if ever), it can be put in
> > > the mapped file. If it often changes, it can be put in the flags file as
> > > a new column there (like after the flags bitfield number).
> > > 
> > > Of course is it sizeof(int) more heap memory per item. But if this field
> > > is needed in your client, it can of course be added (or seek()ed for in
> > > the property's getter if it's not often needed, and can therefore be
> > > slow).
> > > 
> > > > > o. Stores "ftell()" offsets in a simplistic index file. Parsing this 
> > > > > 
> > > > Surely the offsets are fixed width, so a mmap() candidate?
> > > 
> > > Width fixed widths the strings you store are either too limited in their
> > > length (like, 50 characters) or too wide for the typical ones (like 250
> > > chars or more).
> > > 
> > > If you store strings uniquely and have some sort of index file anyway
> > > (like this experiment) then a fixed width for the strings themselves is
> > > not really needed I think. Just ftell() each time you write a string and
> > > update the index.
> > > 
> > > > > o. Has a way to expunge an item (and the sequence numbers of higher
> > > > >    items are adapted automatically)
> > > > >  ...
> > > > > 
> > > > And unsolicited FETCH FLAGS. Possibly others, too.
> > > 
> > > Yep, unsolicited FETCH FLAGS can be coped with with this experimental
> > > implementation (or will be once I coded writing the flags file after a
> > > flags update), EXPUNGE and VANISHED too. An EXISTS followed by a FETCH
> > > ENVELOPE and also an unsolicited NOTIFY's ENVELOPE can be coped with
> > > too.
> > > 
> > > Note that I want the flags file to be written less frequent than flags
> > > get set to items -> some way to delay the write of the file, because
> > > it's quite likely that subsequent flag changes take place in a short
> > > amount of time. Like a burst of unsolicited FETCH FLAGS or the user
> > > doing a "select all" and then "mark as read".
> > > 
> > > > > o. Has a way to get an item by sequence number quickly (hashtable
> > > > >    lookup)
> > >  
> > > > > 
> > > > Polymer uses two caches for this - one which is by UID, but contains  
> > > > no msgno. The other is essentially a sparse list of UIDs.
> > > 
> > > > Finding a sequence number from a UID is done by successive  
> > > > approximation - it assumes that the graph of msgno vs UID  
> > > > approximates a straight line, which astonishingly it does. It does a  
> > > > little more cleverness by quickly checking the currently loaded  
> > > > blocks, too, but that's the bulk of it.
> > > 
> > > Right, sounds sensible to optimise this indeed. Right now I just store
> > > two separate hashtables that have their values point to the same items.
> > > 
> > > It's indeed rather dumb and not very memory-efficient (a hashtable's
> > > node is quite big in size).
> > > 
> > > The reason for these simplistic hashtables's-method was that out of
> > > order inserting of data was easy to implement this way (just insert new
> > > items to the table).
> > > 
> > > 
> > > Mindnote:
> > > A draw-back from a GPtrArray with a hashtable is the foreach speed which
> > > will be needed when copying the item pointers to the GtkTreeModel's
> > > store if that one is GPtrArray based: a foreach over a GPtrArray is of
> > > course just a for-loop, a foreach over a hashtable is (like) a doubly
> > > linked list-loop (does more dereferencing which is afaik slower than
> > > than just doing the i++ math in the for-loop).
> > > 
> > > 
> > > > The effect this'd have would be to (usually) map the first and last  
> > > > blocks, and one other (the one the user was "looking" at). Quite  
> > > > often, all three are the same.
> > > 
> > > > So for tiny mailboxes - one or two blocks - it's a straightforward  
> > > > successive approximation search through the mmapped block containing  
> > > > the UID. (And that's easy to find).
> > > 
> > > > For medium mailboxes, there's the rather nice property that, in  
> > > > general, users get rid of mail fairly evenly throughout. (There's a  
> > > > gentle upward curve, actually, as spam increases).
> > > 
> > > Right.
> > > 
> > > > For massive mailboxes, these turn out to be mailing list archives,  
> > > > etc, where no expunges have happened at all, and the successive  
> > > > approximation technique finds the UID with one probe.
> > > 
> > > 
> > > > > o. Has a way to create new items easily. Deals with out-of-order  
> > > > > adding of items (when requesting an item with a sequence number that  
> > > > > doesn't yet exist, you'll simply get NULL)
> > > > > 
> > > > You probably want to be returning things other than NULL. You've got,  
> > > > at least:
> > > > 
> > > > - "This data cannot exist" - ie, the UID or msgno doesn't exist.
> > > > - "This data doesn't exist" - ie, the UID or msgno exists, but the  
> > > > message does not have this data.
> > > > - "I don't have this data" - ie, the UID or msgno exists, but the  
> > > > data hasn't been requested yet.
> > > 
> > > So some sort of status or error code that you get when requesting an
> > > item. Something like this perhaps?
> > > 
> > > typedef enum {
> > > 	notyetreceived,
> > > 	fine,
> > > 	nonexisting
> > > } SummaryItemState;
> > > 
> > > typedef struct {
> > > 	...
> > > } SummaryItem;
> > > 
> > > 
> > > Summary *s = ...
> > > SummaryItem *item = NULL;
> > > SummaryItemState state;
> > > 
> > > summary_item_by_uid (s, "0001", &item, &state);
> > > 
> > > if (!item && state == notyetreceived) {
> > > 	printf ("...\n");
> > > } else if (item && state == fine) {
> > > 	printf ("%s\n", summary_item_get_subject (item));
> > > }
> > > 
> > > > > o. Scales to > 200,000 items easily
> > > 
> > > > Cool. I'll be able to read my mail on Modest. It managed to  
> > > > synchronize my work account yesterday, after several weeks.
> > > 
> > > :-)
> > > 
> > > Modest is modestly aimed at supporting < ~30,000 items on Nokia's
> > > devices. I have a fairly good idea why more items make the application
> > > too slow (mostly a combination of GtkTreeView-style live sorting while
> > > retrieving takes place and meanwhile periodic writing of the summary's
> > > mmapped file).
> > > 
> > > Regretfully I don't have the funding to work on improving these problems
> > > nor are really large mailboxes a very high priority for Modest atm. This
> > > experimental summary implementation is a first step to trying to solve
> > > one of the two problems Modest right now has, with very large mailboxes.
> > > 
> > > The demoui or TMut can cope with ~ 70,000 items on a N800 device, by the
> > > way. Just don't start sorting things.
> > > 
> > > Obviously must Modest do sorting for the user. An end-user doesn't
> > > really care a lot about why a sorted summary makes his mobile E-mail
> > > client not cope with insanely large mailboxes. Except if he has one,
> > > like you have :D, of course.
> > > 
> > > (Should be fairly easy to mock with Modest's source and shortcut te
> > > GtkTreeModelSortable stuff and just directly assign as the treeview's
> > > model the TnyGtkHeaderListModel -- search for that type --).
> > > 
> > > > > o. Should be relatively thread safe (at least some attempts at  
> > > > > making it
> > > > >    that way are already in place). (although I'd love to use  
> > > > > lock-free
> > > > >    hashtables, I simply used mutexes for this)
> > > > > 
> > > > > 
> > > > I wouldn't get stressed about this. Instead, I'd suggest you get rid  
> > > > of whatever inane threading is going on, and/or supply a wrapper API  
> > > > which marshals requests to the single thread handling the cache and  
> > > > protocol.
> > > 
> > > Regretfully, though, you need to access the summary from both the UI
> > > thread and from the thread that is receiving stuff.
> > > 
> > > You can't really marshal the requests coming from the UI thread unless
> > > you want to make things hideously slow for the user interface. You can,
> > > however, indeed, marshal the requests (to store new stuff) coming from
> > > the thread that is receiving new material (you queue it).
> > > 
> > > Queueing for that thread is something I think the application itself (or
> > > in my case Tinymail, for that matter) should take care of, no? That's my
> > > current idea.
> > > 
> > > -> Direct access to the summary from the UI's thread
> > > -> Queued access to the summary from the data receiving thread, perhaps
> > >    queued in the UI's thread (which wouldn't require thread safety
> > >    anymore then)
> > > 
> > > > > o. A freeze/thaw API for Dave so that he can make multiple changes.  
> > > > > This integrates with (#z) of course.
> > > > 
> > > > Yes... I find it improves performance hugely when the user highlights  
> > > > all messages and hits "delete".
> > > 
> > > That should result in saving three empty files once the freeze API is
> > > called for. Will be fast, indeed.
> > > 
> > > 
> > > -- 
> > > Philip Van Hoof, freelance software developer
> > > home: me at pvanhoof dot be 
> > > gnome: pvanhoof at gnome dot org 
> > > http://pvanhoof.be/blog
> > > http://codeminded.be
> > > 
> > > _______________________________________________
> > > tinymail-devel-list mailing list
> > > tinymail-devel-list gnome org
> > > http://mail.gnome.org/mailman/listinfo/tinymail-devel-list
> _______________________________________________
> tinymail-devel-list mailing list
> tinymail-devel-list gnome org
> http://mail.gnome.org/mailman/listinfo/tinymail-devel-list
-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be

#include <glib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "summary.h"

typedef struct _SummaryBlock SummaryBlock;

static inline void summary_block_kill (SummaryBlock *b);
static void summary_block_dump (SummaryBlock *b);

/* The size of a pointer in this documentation assumes a 32bit architecture. On
 * your fancy 64bit computer will each pointer be larger! size_t and off_t will
 * also be different! Recount to know the memory consumption on such archs. */

struct _SummaryItem {

	/* We don't use all 32 bits of the flags,perhaps we can make it 16 bits
	 * too. although right now I don't have another field to pad the data 
	 * with anyway. So I just keep it at 32 bits. */
	guint32 flags; /* 4 bytes */

	/* Sizes can be quite large, keep it 32 bits */
	size_t size; /* 4 bytes */

	/* A 16 bit sequence would mean maximum 65536 items: not enough, so 
	 * let's stick that to 32 bit in stead. */
	guint32 seq; /* 4 bytes */

	/* padding trick { */

		/* Blocks will have a high ref_count, but items wont, therefore are
		 * 16 bits fine for this field */
		gint16 ref_count;

		/* This is just a bool, a bitfield doesn't help, so I just made it 
		 * 16 bits too. Together with the ref_count, it forms one word, saving
		 * us 4 bytes per item. */
		gint16 mem;

	/* } together: 32 bits, 4 bytes */


	/* Let's now put all the pointers together, they'll each consume four 
	 * bytes (we can't do a lot of padding optimizations here anyway) */

	char *uid; /* 4 bytes */
	SummaryBlock *block; /* 4 bytes */

	union {
		struct {
			off_t from_offset;
			off_t subject_offset;
			off_t to_offset;
			off_t cc_offset;
		} offsets;
		struct {
			char *from;
			char *subject;
			char *to;
			char *cc;
		} pointers;
	} mmappable; /* 16 bytes */

}; /* Frequence of this type is significant to care about its size: 40 bytes per
    * item (plus heap admin, which is relatively small with gslice). Note that 
    * the VmRSS will also suffer from pages in the mapping that are needed! */

   /* Also important for the size are the hashtable entries in uids and seqs.
    * each such GHashNode is at least 16 bytes. Multiplied by two hashtables
    * that's 32 extra bytes. On top of that we store the uid twice in memory:
    * once as key and once in the item. ~12 bytes x 2= 24 bytes.*/

   /* So 40+32+24+mapping */
 
   /* Everything added I estimate that each item will consume between 100 and
    * 200 bytes in VmRSS (around 4 MB for 20,000 items) (TODO: accurately 
    * measure this) */

struct _SummaryBlock {
	int ref_count;
	char *data;
	FILE *data_file;
	int mmap_size;
	int block_id;
	int written;
	GHashTable *uids;
	GHashTable *seqs;
	GStaticRecMutex *lock;
}; /* Frequence of this type is not significant enough to care about its size */

struct _Summary {
	GPtrArray *blocks;
	GStaticRecMutex *lock;
}; /* Frequence of this type is one per folder: don't care about its size*/


static inline char*
summary_block_index_filename (SummaryBlock *b)
{
	return g_strdup_printf ("index_%d.idx", b->block_id);
}

static inline char*
summary_block_flags_filename (SummaryBlock *b)
{
	return g_strdup_printf ("flags_%d.idx", b->block_id);
}

static inline char*
summary_block_data_filename (SummaryBlock *b)
{
	return g_strdup_printf ("data_%d.mmap", b->block_id);
}

static inline void 
prepare_summary_block (SummaryBlock *b)
{
	b->uids = g_hash_table_new_full (g_str_hash, g_str_equal, (GDestroyNotify) g_free, NULL);
	b->seqs = g_hash_table_new_full (g_int_hash, g_int_equal, NULL, NULL);
}

static inline SummaryBlock*
create_new_summary_block (int block_id)
{
	SummaryBlock *b = g_slice_new0 (SummaryBlock);
	b->lock = g_new0 (GStaticRecMutex, 1);
	g_static_rec_mutex_init (b->lock);

	g_static_rec_mutex_lock (b->lock);
	b->block_id = block_id;
	b->written = FALSE;
	prepare_summary_block (b);
	g_static_rec_mutex_unlock (b->lock);

	return b;
}


static inline void 
summary_block_add_item_int (SummaryBlock *b, SummaryItem *i)
{
	g_static_rec_mutex_lock (b->lock);
	i->block = b;
	b->ref_count++;
	g_hash_table_insert (b->uids, g_strdup (i->uid), i);
	g_hash_table_insert (b->seqs, &i->seq, i);
	g_static_rec_mutex_unlock (b->lock);
}

static inline void 
summary_block_add_item (SummaryBlock *b, SummaryItem *i)
{
	g_static_rec_mutex_lock (b->lock);

	if (i->block && i->block != b) {
		g_static_rec_mutex_lock (i->block->lock);
		i->block->written = FALSE;
		g_hash_table_remove (i->block->uids, i->uid);
		g_hash_table_remove (i->block->seqs, &i->seq);
		i->block->ref_count--;
		if (i->block->ref_count <= 0)
			summary_block_kill (i->block);
		g_static_rec_mutex_unlock (i->block->lock);
	}

	i->block = b;
	b->ref_count++;
	b->written = FALSE;
	/* TODO: insert extended and remove original if dup */
	g_hash_table_insert (b->uids, g_strdup (i->uid), i);
	g_hash_table_insert (b->seqs, &i->seq, i);
	g_static_rec_mutex_unlock (b->lock);
}

void
summary_add_item (Summary *s, SummaryItem *i)
{
	int block_id = 0; //i->seq % 1000;
	SummaryBlock *b;

	g_static_rec_mutex_lock (s->lock);

	if (s->blocks->len < block_id+1) {
		int t;
		for (t = s->blocks->len; t < block_id+1; t++) {
			SummaryBlock *mnew = create_new_summary_block (t);
			g_ptr_array_add (s->blocks, mnew);
		}
	}

	b = (SummaryBlock *) s->blocks->pdata[block_id];
	summary_block_add_item (b, i);

	g_static_rec_mutex_unlock (s->lock);

}

static inline SummaryItem*
summary_block_get_item_by_seq (SummaryBlock *b, int seq)
{
	return (SummaryItem *) g_hash_table_lookup (b->seqs, &seq);
}


SummaryItem*
summary_get_item_by_seq (Summary *s, int seq)
{
	int i;
	SummaryItem *item = NULL;

	g_static_rec_mutex_lock (s->lock);

	for (i = 0; i < s->blocks->len; i++) {
		SummaryBlock *b = (SummaryBlock *) s->blocks->pdata[i];
		item = summary_block_get_item_by_seq (b, seq);
		if (item) {
			item->ref_count++;
			item->block->ref_count++;
			break;
		}
	}

	g_static_rec_mutex_unlock (s->lock);

	return item;
}


static inline SummaryItem*
summary_block_get_item_by_uid (SummaryBlock *b, const char *uid)
{
	return (SummaryItem *) g_hash_table_lookup (b->uids, uid);
}

SummaryItem*
summary_get_item_by_uid (Summary *s, const char *uid)
{
	int i;
	SummaryItem *item = NULL;

	g_static_rec_mutex_lock (s->lock);

	for (i = 0; i < s->blocks->len; i++) {
		SummaryBlock *b = (SummaryBlock *) s->blocks->pdata[i];
		item = summary_block_get_item_by_uid (b, uid);
		if (item) {
			item->ref_count++;
			item->block->ref_count++;
			break;
		}
	}

	g_static_rec_mutex_unlock (s->lock);

	return item;
}

static inline void 
summary_expunge_item_shared (SummaryBlock *block, SummaryItem *i)
{
	GList *values; gint seq = i->seq;
	g_hash_table_remove (block->uids, i->uid);
	g_hash_table_remove (block->seqs, &i->seq);
	values = g_hash_table_get_values (block->uids);
	while (values) {
		SummaryItem *item = values->data;
		if (item->seq > seq) {
			g_hash_table_remove (block->seqs, &item->seq);
			item->seq--;
			g_hash_table_insert (block->seqs, &item->seq, item);
		}
		values = g_list_next (values);
	}
	g_list_free (values);
}

void 
summary_expunge_item_by_uid (Summary *s, const char *uid)
{
	int i;
	SummaryItem *item = NULL;

	g_static_rec_mutex_lock (s->lock);

	for (i = 0; i < s->blocks->len; i++) {
		SummaryBlock *b = (SummaryBlock *) s->blocks->pdata[i];
		item = summary_block_get_item_by_uid (b, uid);
		if (item) {
			summary_expunge_item_shared (b, item);
			break;
		}
	}
	g_static_rec_mutex_unlock (s->lock);
	return;
}

void 
summary_expunge_item_by_seq (Summary *s, int seq)
{
	int i;
	SummaryItem *item = NULL;

	g_static_rec_mutex_lock (s->lock);

	for (i = 0; i < s->blocks->len; i++) {
		SummaryBlock *b = (SummaryBlock *) s->blocks->pdata[i];
		item = summary_block_get_item_by_seq (b, seq);
		if (item) {
			summary_expunge_item_shared (b, item);
			break;
		}
	}

	g_static_rec_mutex_unlock (s->lock);

	return;
}

static inline SummaryBlock *
read_summary_block (int block_id, SummaryBlock *b_in)
{
	FILE *idx, *flags;
	SummaryBlock *b;
	struct stat buf;
	char *index_filename;
	char *data_filename;
	char *flags_filename;

	if (b_in)
		b = b_in;
	else {
		b = g_slice_new0 (SummaryBlock);
		b->lock = g_new0 (GStaticRecMutex, 1);
		prepare_summary_block (b);
		g_static_rec_mutex_init (b->lock);
	}

	g_static_rec_mutex_lock (b->lock);

	b->block_id = block_id;
	b->written = TRUE;

	index_filename = summary_block_index_filename (b);

	idx = fopen (index_filename, "r");

	if (!idx) {
		b->written = FALSE;
		g_static_rec_mutex_unlock (b->lock);
		return b;
	}

	data_filename = summary_block_data_filename (b);

	if (b->data_file)
		fclose (b->data_file);
	if (b->data)
		munmap (b->data, b->mmap_size);

	b->data_file = fopen (data_filename, "r");

	if (!b->data_file) {
		b->written = FALSE;
		fclose (idx);
		g_free (index_filename);
		g_free (data_filename);
		g_static_rec_mutex_unlock (b->lock);
		return b;
	}

	flags_filename = summary_block_flags_filename (b);
	flags = fopen (flags_filename, "r");
	if (!flags) {
		b->written = FALSE;
		fclose (idx);
		fclose (b->data_file);
		b->data_file = NULL;
		g_free (index_filename);
		g_free (data_filename);
		g_free (flags_filename);
		g_static_rec_mutex_unlock (b->lock);
		return b;
	}

	g_free (index_filename);
	g_free (data_filename);
	g_free (flags_filename);

	fstat(fileno (b->data_file), &buf);
	b->mmap_size = buf.st_size;
	b->data = mmap (NULL, b->mmap_size, PROT_READ, MAP_PRIVATE, fileno (b->data_file), 0);

	while (!feof (idx)) {
		int len;
		if (fscanf (idx, "%d", &len) != EOF) 
		{
			SummaryItem *cur;
			char *uid = (char *) malloc(len);
			gboolean do_add = FALSE;

			memset (uid, 0, len);
			fgetc (idx);
			fread (uid, 1, len, idx);
			uid[len] = '\0';

			cur = summary_block_get_item_by_uid (b, uid);
			if (!cur) {
				do_add = TRUE;
				cur = g_slice_new0 (SummaryItem);
				cur->ref_count = 1;
			}
			cur->uid = uid;
			cur->block = b;

			if (cur->mem == 1) {
				g_free (cur->mmappable.pointers.from);
				g_free (cur->mmappable.pointers.subject);
				g_free (cur->mmappable.pointers.to);
				g_free (cur->mmappable.pointers.cc);
			}
			cur->mem = 0;

			fscanf (idx, "%d%d%d%d%d%d",
				&cur->seq, &cur->size,
				(int *) &cur->mmappable.offsets.subject_offset, 
				(int *) &cur->mmappable.offsets.from_offset,
				(int *) &cur->mmappable.offsets.to_offset,
				(int *) &cur->mmappable.offsets.cc_offset);

			if (do_add)
				summary_block_add_item_int (b, cur);
		}
	}
	fclose (idx);

	/* Perhaps scan this in a table and loop the table in above loop  in stead */
	while (!feof (flags)) {
		int seq, flag;
		if (fscanf (flags, "%d%d", &seq, &flag) != EOF) {
			SummaryItem *c = summary_block_get_item_by_seq (b, seq);
			if (c)
				c->flags = flag;
		}
	}
	fclose(flags);

	g_static_rec_mutex_unlock (b->lock);

	return b;
}

const char*
summary_item_get_uid (SummaryItem *i)
{
	return i->uid;
}

int
summary_item_get_seq (SummaryItem *i)
{
	return (const int) i->seq;
}

int
summary_item_get_flags (SummaryItem *i)
{
	return (const int) i->flags;
}

void
summary_item_set_flags (SummaryItem *i, int new_flags)
{
	i->flags = new_flags;
	/* todo: rewrite flags file */
	return;
}


const char*
summary_item_get_subject (SummaryItem *i)
{
	const char* retval;

	if (i->block)
		g_static_rec_mutex_lock (i->block->lock);
	if (i->mem == 1)
		retval = i->mmappable.pointers.subject;
	else
		retval = i->block->data + i->mmappable.offsets.subject_offset;
	if (i->block)
		g_static_rec_mutex_unlock (i->block->lock);

	return retval;
}

const char*
summary_item_get_from (SummaryItem *i)
{
	const char *retval;

	if (i->block)
		g_static_rec_mutex_lock (i->block->lock);
	if (i->mem == 1)
		retval = i->mmappable.pointers.from;
	else 
		retval = i->block->data + i->mmappable.offsets.from_offset;
	if (i->block)
		g_static_rec_mutex_unlock (i->block->lock);

	return retval;
}


const char*
summary_item_get_to (SummaryItem *i)
{
	const char *retval;

	if (i->block)
		g_static_rec_mutex_lock (i->block->lock);
	if (i->mem == 1)
		retval = i->mmappable.pointers.to;
	else
		retval = i->block->data + i->mmappable.offsets.to_offset;
	if (i->block)
		g_static_rec_mutex_unlock (i->block->lock);

	return retval;
}

const char*
summary_item_get_cc (SummaryItem *i)
{
	const char *retval;

	if (i->block)
		g_static_rec_mutex_lock (i->block->lock);
	if (i->mem == 1)
		retval = i->mmappable.pointers.cc;
 	else
		retval = i->block->data + i->mmappable.offsets.cc_offset;
	if (i->block)
		g_static_rec_mutex_unlock (i->block->lock);

	return retval;
}

static inline void 
summary_block_kill (SummaryBlock *b)
{
	if (!b->written)
		summary_block_dump (b);
	if (b->data)
		munmap (b->data, b->mmap_size);
	if (b->data_file)
		fclose (b->data_file);

	g_hash_table_destroy (b->seqs);
	g_hash_table_destroy (b->uids);
	b->seqs = NULL;
	b->uids = NULL;

	g_free (b->lock);
	g_slice_free (SummaryBlock, b);

}

SummaryItem* 
summary_item_ref (SummaryItem *i)
{
	i->ref_count++;

	if (i->block) {
		g_static_rec_mutex_lock (i->block->lock);
		i->block->ref_count++;
		g_static_rec_mutex_unlock (i->block->lock);
	}

	return i;
}

void
summary_item_unref (SummaryItem *i)
{
	i->ref_count--;

	if (i->block) {
		g_static_rec_mutex_lock (i->block->lock);
		i->block->ref_count--;
		if (i->block->ref_count <= 0)
			summary_block_kill (i->block);
		g_static_rec_mutex_unlock (i->block->lock);
	}

	if (i->ref_count <= 0) 
	{
		g_free (i->uid);
		if (i->mem == 1) {
			g_free (i->mmappable.pointers.from);
			g_free (i->mmappable.pointers.subject);
			g_free (i->mmappable.pointers.to);
			g_free (i->mmappable.pointers.cc);
			i->mem = 0;
		}
		g_slice_free (SummaryItem, i);
	}
}


typedef struct {
	char *str;
	int cnt, cpy;
	GList *affected;
} StoreString;


static gint 
cmp_storestring (gconstpointer a, gconstpointer b)
{
	StoreString *sa = (StoreString *) a;
	StoreString *sb = (StoreString *) b;
	return sb->cnt - sa->cnt;
}


static void
summary_block_dump (SummaryBlock *b)
{
	FILE *data, *idx, *flags;
	GList *orig = g_hash_table_get_values (b->seqs), *items = orig;
	GHashTable *strings = g_hash_table_new_full (g_str_hash, g_str_equal, g_free, NULL);
	GList *sstrings;

	char *index_filename;
	char *data_filename;
	char *flags_filename;
	char *index_filename_n;
	char *data_filename_n;
	char *flags_filename_n;

	g_static_rec_mutex_lock (b->lock);

	if (b->written) {
		g_static_rec_mutex_unlock (b->lock);
		return;
	}

	index_filename = summary_block_index_filename (b);
	data_filename = summary_block_data_filename (b);
	flags_filename = summary_block_flags_filename (b);

	index_filename_n = g_strdup_printf ("%s~", index_filename);
	data_filename_n = g_strdup_printf ("%s~", data_filename);
	flags_filename_n = g_strdup_printf ("%s~", flags_filename);

	idx = fopen (index_filename_n, "w");
	data = fopen (data_filename_n, "w");
	flags = fopen (flags_filename_n, "w");


	while (items) {
		SummaryItem *item = (SummaryItem *) items->data;

		StoreString *sb;
		char *str;

		/* subject */
		if (!item->mem)
			str = item->block->data + item->mmappable.offsets.subject_offset;
		else 
			str = item->mmappable.pointers.subject;

		sb =  g_hash_table_lookup (strings, str);
		if (sb) {
			sb->cnt++;
			sb->cpy = FALSE;
		} else {
			sb = g_slice_new0 (StoreString);
			sb->cnt = 1;
			sb->str = item->mem ? g_strdup (item->mmappable.pointers.subject) : str;
			sb->cpy = item->mem;
			g_hash_table_insert (strings, g_strdup (str), sb);
		}
		sb->affected = g_list_prepend (sb->affected, &item->mmappable.offsets.subject_offset);

		/* from */
		if (!item->mem)
			str = item->block->data + item->mmappable.offsets.from_offset;
		else 
			str = item->mmappable.pointers.from;

		sb = g_hash_table_lookup (strings, str);
		if (sb) {
			sb->cnt++;
			sb->cpy = FALSE;
		} else {
			sb = g_slice_new0 (StoreString);
			sb->cnt = 1;
			sb->str = item->mem ? g_strdup (item->mmappable.pointers.from) : str;
			sb->cpy = item->mem;
			g_hash_table_insert (strings, g_strdup (str), sb);
		}
		sb->affected = g_list_prepend (sb->affected, &item->mmappable.offsets.from_offset);


		/* to */
		if (!item->mem)
			str = item->block->data + item->mmappable.offsets.to_offset;
		else
			str = item->mmappable.pointers.to;

		sb = g_hash_table_lookup (strings, str);
		if (sb) {
			sb->cnt++;
			sb->cpy = FALSE;
		} else {
			sb = g_slice_new0 (StoreString);
			sb->cnt = 1;
			sb->str = item->mem ? g_strdup (item->mmappable.pointers.to) : str;
			sb->cpy = item->mem;
			g_hash_table_insert (strings, g_strdup (str), sb);
		}
		sb->affected = g_list_prepend (sb->affected, &item->mmappable.offsets.to_offset);


		/* cc */
		if (!item->mem)
			str = item->block->data + item->mmappable.offsets.cc_offset;
		else
			str = item->mmappable.pointers.cc;

		sb = g_hash_table_lookup (strings, str);
		if (sb) {
			sb->cnt++;
			sb->cpy = FALSE;
		} else {
			sb = g_slice_new0 (StoreString);
			sb->cnt = 1;
			sb->str = item->mem ? g_strdup (item->mmappable.pointers.cc) : str;
			sb->cpy = item->mem;
			g_hash_table_insert (strings, g_strdup (str), sb);
		}
		sb->affected = g_list_prepend (sb->affected, &item->mmappable.offsets.cc_offset);

		if (item->mem == 1) {
			g_free (item->mmappable.pointers.from);
			g_free (item->mmappable.pointers.subject);
			g_free (item->mmappable.pointers.to);
			g_free (item->mmappable.pointers.cc);
			item->mem = 0;
		}

		items = g_list_next (items);
	}

	sstrings = g_hash_table_get_values (strings);
	sstrings = g_list_sort (sstrings, cmp_storestring);

	while (sstrings) {
		StoreString *ss = (StoreString *) sstrings->data;
		int location = ftell (data);

		fputs (ss->str, data);
		if (ss->cpy) {
			g_free (ss->str);
			ss->cpy = FALSE;
		}
		fputc ('\0', data);

		while (ss->affected) {
			int *data = ss->affected->data;
			*data = location;
			ss->affected = g_list_next (ss->affected);
		}
		g_list_free (ss->affected);
		g_slice_free (StoreString, ss);
		sstrings = g_list_next (sstrings);
	}

	g_list_free (sstrings);
	g_hash_table_destroy (strings);

	fclose (data);

	items = orig;
	while (items) {
		SummaryItem *item = (SummaryItem *) items->data;

		fprintf (flags, "%d %d\n", item->seq, item->flags); 
		fprintf (idx, "%d %s %d %d %d %d %d %d\n",
				strlen (item->uid), item->uid,
				item->seq, item->size,
				(int) item->mmappable.offsets.subject_offset, 
				(int) item->mmappable.offsets.from_offset,
				(int) item->mmappable.offsets.to_offset,
				(int) item->mmappable.offsets.cc_offset);

		items = g_list_next (items);
	}


	g_list_free (orig);

	fclose (idx);
	fclose (flags);

	if (b->data)
		munmap (b->data, b->mmap_size);
	if (b->data_file)
		fclose (b->data_file);

	b->data = NULL;
	b->data_file = NULL;

	rename (index_filename_n, index_filename);
	rename (data_filename_n, data_filename);
	rename (flags_filename_n, flags_filename);

	/* Reread it (this copes with existing items too) */
	read_summary_block (b->block_id, b);

	g_free (index_filename);
	g_free (data_filename);
	g_free (flags_filename);
	g_free (index_filename_n);
	g_free (data_filename_n);
	g_free (flags_filename_n);

	g_static_rec_mutex_unlock (b->lock);

	return;
}

static void
summary_dump (Summary *summary)
{
	int i;

	g_static_rec_mutex_lock (summary->lock);

	for (i = 0; i < summary->blocks->len; i++) {
		SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
		summary_block_dump (b);
	}

	g_static_rec_mutex_unlock (summary->lock);

	return;
}

static inline void
foreach_item_free (gpointer key, gpointer value, gpointer user_data)
{
	SummaryItem *i = (SummaryItem *) value;
	summary_item_unref (i);
}


static inline void
summary_block_free (SummaryBlock *b)
{
	g_hash_table_foreach (b->seqs, foreach_item_free, NULL);
}

void
summary_free (Summary *summary)
{
	int i;

	g_static_rec_mutex_lock (summary->lock);

	for (i = 0; i < summary->blocks->len; i++) {
		SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];

		g_static_rec_mutex_lock (b->lock);
		summary_block_free (b);
		b->ref_count--;
		if (b->ref_count <= 0)
			summary_block_kill (b);
		g_static_rec_mutex_unlock (b->lock);
	}
	g_ptr_array_free (summary->blocks, TRUE);

	g_static_rec_mutex_unlock (summary->lock);

	g_free (summary->lock);
	g_slice_free (Summary, summary);

	return;
}


Summary*
summary_open (const char *folder_path)
{
	Summary *summary = g_slice_new0 (Summary);
	summary->lock = g_new0 (GStaticRecMutex, 1);
	int i, items;
	char *filename = g_strdup_printf ("%s/summary.idx", folder_path);
	FILE *f = fopen (filename, "r");
	g_free (filename);

	g_static_rec_mutex_init (summary->lock);

	fscanf (f, "%d", &items);

	g_static_rec_mutex_lock (summary->lock);

	summary->blocks = g_ptr_array_sized_new (items);

	for (i=0; i < items; i++) {
		SummaryBlock *b = read_summary_block (i, NULL);
		g_ptr_array_add (summary->blocks, b);
		b->ref_count++;
	}

	g_static_rec_mutex_unlock (summary->lock);

	return summary;
}

static inline int
cmpstringp(const void *p1, const void *p2)
{
        return strcmp(* (char * const *) p1, * (char * const *) p2);
}


static GPtrArray*
split_recipients (gchar *buffer)
{
        gchar *tmp, *start;
        gboolean is_quoted = FALSE;
        GPtrArray *array = g_ptr_array_new ();

        start = tmp = buffer;

        while (*tmp != '\0') {
                if (*tmp == '\"')
                        is_quoted = !is_quoted;
                if (*tmp == '\\')
                        tmp++;
                if ((!is_quoted) && ((*tmp == ',') || (*tmp == ';'))) {
                        gchar *part;
                        part = g_strndup (start, tmp - start);
                        part = g_strstrip (part);
                        g_ptr_array_add (array, part);
                        start = tmp+1;
                }

                tmp++;
        }

        if (start != tmp)
                g_ptr_array_add (array, g_strstrip (g_strdup (start)));

        return array;
}


/* In reality will this be done by the application in stead (probably by the ENVELOPE parser!) */

static gchar *
sort_address_list (const gchar *unsorted)
{
	GPtrArray *a = split_recipients ((gchar*)unsorted);
	gint i; GString *result = g_string_new ("");
	gchar *item;

	g_ptr_array_sort (a, (GCompareFunc) cmpstringp);

	for (i=0; i < a->len; i++) {
		if (i > 0) {
			g_string_append_c (result, ',');
			g_string_append_c (result, ' ');
		}
		g_string_append (result, a->pdata[i]);
		g_free (a->pdata[i]);
	}

	g_ptr_array_free (a, TRUE);
	item = result->str;
	g_string_free (result, FALSE);

	return item;
}

SummaryItem*
summary_create_item (const char *uid, int seq, int flags, size_t size, const char *from, const char *subject, const char *to, const char *cc)
{
	SummaryItem *item = g_slice_new0 (SummaryItem);

	item->ref_count = 1;
	item->size = size;
	item->seq = seq;
	item->flags = flags;
	item->uid = g_strdup (uid);
	item->mmappable.pointers.from = g_strdup (from);
	item->mmappable.pointers.to = sort_address_list (to);
	item->mmappable.pointers.cc = sort_address_list (cc);
	item->mmappable.pointers.subject = g_strdup (subject);
	item->block = NULL;
	item->mem = 1; /* memory item */

	return item;
}

static void 
foreach_item_print_it (gpointer key, gpointer value, gpointer user_data)
{
	SummaryItem *mc = (SummaryItem *) value;

	printf ("UID %s is at sequence %d with flags %d\n", 
		summary_item_get_uid (mc),
		summary_item_get_seq (mc),
		summary_item_get_flags (mc));

	printf ("\tFrom=%s\n\tSubj=%s\n\tCc=%s\n\tTo=%s\n",
		summary_item_get_from (mc),
		summary_item_get_subject (mc),
		summary_item_get_cc (mc),
		summary_item_get_to (mc));
}

static void 
print_summary (Summary *summary)
{
	int i;

	for (i = 0; i < summary->blocks->len; i++) {
		SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
		g_hash_table_foreach (b->seqs, foreach_item_print_it, NULL);
	}
}

int 
main (int argc, char **argv)
{
	Summary *summary = summary_open (".");
	int i, y = 0;
	GList *vals = NULL;

	/* printf ("sizeof (SummaryItem): %d\n", sizeof (SummaryItem)); */

	/* print_summary (summary); */

	for (i = 0; i < summary->blocks->len; i++) {
		gint z = 0;
		SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
		GList *l = g_hash_table_get_values (b->seqs);

		while (l) {
			if (z % 2) {
				SummaryItem *i = summary_item_ref (l->data);
				vals = g_list_prepend (vals, i);
			}
			l = g_list_next (l); 
			z++;
		}
		g_list_free (l);
	}

	if (argc > 1) {


		for (i = 0; i < summary->blocks->len; i++) {
			SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
			y += g_hash_table_size (b->seqs);
		}

		for (i = y; i < y + 50000; i++) {
			const char *uid = g_strdup_printf ("%08d", i);
			int seq = 10+i;
			int flags = i+1;
			size_t size = i+2;
			char *from = g_strdup_printf ("Jan%dtje <jantje gmail com>",i);
			const char *subject = g_strdup_printf ("Ik ga naar %dde bakker he Pietje!",i);
			const char *to = g_strdup_printf ("Gr%dietje <grietje gmail com>, Ha%dnsje <hansje gmail com>, Pietje <pi%detje gmail com>",i,i,i);
			const char *cc = g_strdup_printf ("Pietje <p%dietje gmail com>, Hansje <hansj%de gmail com>, Grietje <gr%dietje gmail com>", i,i,i);

			SummaryItem *item = summary_create_item (uid, seq, flags, size, from, subject, to, cc);
			summary_add_item (summary, item);
		}

		summary_dump (summary);

		printf ("\n\n\n\n");

		print_summary (summary);

	}

if (FALSE) {
	printf ("\n\n---\n\n");
	SummaryItem *item1 = summary_get_item_by_seq (summary, 22);
	printf ("%s\n", summary_item_get_from (item1));
	summary_expunge_item_by_seq (summary, 22);
	item1 = summary_get_item_by_seq (summary, 22);
	if (item1) {
		printf ("%s\n", summary_item_get_from (item1));
	}
	getchar();
}

	summary_free (summary);

getchar();

	/* Print the odd ones */
	while (vals) {
		SummaryItem *i = vals->data;

		foreach_item_print_it (&i->seq, i, NULL);

		summary_item_unref (i);
		vals = g_list_next (vals);
	}
	g_list_free (vals);

	printf ("The blocks should be freed now\n");

	return 0;
}

References:
- Re: Summary store, impl n. 7
  - From: Dave Cridland
- Re: Summary store, impl n. 7
  - From: Philip Van Hoof
- Re: Summary store, impl n. 7
  - From: Philip Van Hoof
- Re: Summary store, impl n. 8
  - From: Philip Van Hoof

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]