Re: Summary store, impl n. 8
- From: Philip Van Hoof <spam pvanhoof be>
- To: Dave Cridland <dave cridland net>
- Cc: Jeffrey Stedfast <fejj novell com>, tinymail-devel-list gnome org
- Subject: Re: Summary store, impl n. 8
- Date: Fri, 04 Jan 2008 00:31:57 +0100
Note that opening this highly redundant sample, you get the following
numbers:
VmPeak: 9368 kB
VmSize: 9368 kB
VmLck: 0 kB
VmHWM: 7600 kB
VmRSS: 7600 kB
VmData: 7084 kB
VmStk: 84 kB
VmExe: 16 kB
VmLib: 2152 kB
VmPTE: 16 kB
Of course is VmSize very small because all the strings are redundant
(and therefore it's only once mapped in memory).
We can see that VmSize is equal to VmRss (only a very small difference)
This is an example where every string (25 bytes per string) is unique. I
just added a %d to each and every string, and I used the i variable in
the loop to give that %d a value.
VmPeak: 22076 kB
VmSize: 22076 kB
VmLck: 0 kB
VmHWM: 7604 kB
VmRSS: 7604 kB
VmData: 7084 kB
VmStk: 88 kB
VmExe: 16 kB
VmLib: 2152 kB
VmPTE: 16 kB
This is the result that I want, indeed. We can clearly see that VmSize
is at 22MB whereas VmRss (which is what matters) is still at the exact
same size as the sample above.
The data_0.mmap file was 13MB in size in that example.
I attached the summary.c file so that you can reproduce this test
yourself too (it has a getchar() somewhere, so that you can read the
statistics while the software waits for stdin's character).
On Fri, 2008-01-04 at 00:19 +0100, Philip Van Hoof wrote:
> On Tue, 2008-01-01 at 23:26 +0100, Philip Van Hoof wrote:
> > Interesting idea about "cc" and "to" addresses.
>
> Experimentally implemented. It's the caller who'll create SummaryItems
> who'll have to provide sorted addresslists by modifying his ENVELOPE
> parser in such a way that convenient strings are passed.
>
> I have added to code internally to show how it would work.
>
> The example has the to and cc in reversed order, if you make the (just
> add one (any) argument to the cmdline when running) and open the
> data_0.mmap file, you'll notice that only one string is stored for both
> cc and to (the example reuses all strings 50,000 times, but it's of
> course an sample with extremely redundant data).
>
>
>
> > dvalue :)
> > pvanhoof And flag changes of items happen in a separate file (this is for email)
> > pvanhoof Nah, they are always strings. I don't think I need a generic type
> > joh desrt: Can't wait to see dconf finished :)
> > desrt pvanhoof; it's more for the convenience factor
> > desrt there's a really nice "write this to a binary file" API
> > pvanhoof But I'm sure following up on dvalue and dconf :)
> > desrt and "read this out of a binary file" API
> > * tbf (~mathias ip-80-226-0-1 vodafone-net de) has joined #gnome-hackers
> > desrt what is the file a mapping of, anyway?
> > desrt ID numbers to strings?
> > pvanhoof summary data, that's your subject, cc, to, from, uid, sequence number
> > pvanhoof And the flags of your email
> > pvanhoof or, the ENVELOPE in IMAP terminology
> > desrt so you want to store common "from" and "to" fields without duplication
> > pvanhoof aye
> > desrt cool
> > pvanhoof And most referred to grouped together
> > pvanhoof The goal is to have fewer 4k pages in real memory modules
> > desrt is the to a string or an array of strings?
> > desrt (think: multiple recipients)
> > pvanhoof And most of the less frequently used ones in virtual memory
> > pvanhoof I handle all of the summary items as just strings
> > desrt recogniser your to/cc strategy
> > pvanhoof To save pointers
> > desrt er
> > desrt *reconsider
> > pvanhoof Yes, true. cc and to could be lists of strings to find even more duplicates
> > pvanhoof but a list would also mean, more pointers in my heap
> > desrt it's true.
> > desrt but chances are that a huge cc: list isn't going to be exactly the same every time around
> > desrt although you could improve your chances by sorting the list
> > desrt so at least permutations don't count as different entities
> > pvanhoof Agree, but chances that a user has a lot of huge cc lists are rather small (spammers don't do this a lot :p)
> > desrt :)
> > pvanhoof yep, I sort the strings on frequency used
> > desrt no no
> > pvanhoof so those unique strings will be at the tail of the file
> > desrt i mean if you have
> > desrt cc: pvanhoof gnome org, desrt desrt ca
> > tbf pvanhoof: you really don't know normal people....
> > desrt then later
> > desrt cc: desrt desrt ca, pvanhoof gnome org
> > * RealNitro has quit (Ex-Chat)
> > pvanhoof oh yes
> > desrt ^ these are the same, but would be stored separately
> > pvanhoof that would be two strings, four pointers ..
> > desrt well
> > desrt using my method it would be
> > desrt using your method it would be 2 (long) strings, 2 pointers
> > dobey joh: store the "bookmarked" list in a different key
> > desrt if you sort the list and then use your method
> > pvanhoof right now this is two strings two pointers, but for larger samples it makes sense to store addresses separately yes
> > tbf pvanhoof: mails regarding my fencing club usually have quite large CC lists
> > desrt then it's 1 long string and 2 pointers
> > desrt you're still not listening to me :p
> > dobey joh: though i don't know how you're storing anything currently
> > pvanhoof I am I am :)
> > desrt i mean that you should take the _string_ "pvanhoof gnome org, desrt desrt ca" and convert it to the string "desrt desrt ca, pvanhoof gnome org"
> > pvanhoof ah yes
> > pvanhoof interesting
> > desrt since that way you avoid having lots of copies of what is essentially the same list
> > pvanhoof Just sort the string's substrings
> > desrt exactly
> > pvanhoof that's an interesting idea indeed
> > desrt ok. dinner time. bye :)
> > lifeless fails when you have 2 and 3 element long strings with common elements :)
> >
> >
> > On Tue, 2008-01-01 at 23:08 +0100, Philip Van Hoof wrote:
> > > On Tue, 2008-01-01 at 20:31 +0000, Dave Cridland wrote:
> > > > On Thu Dec 27 19:53:03 2007, Philip Van Hoof wrote:
> > >
> > > > > o. Stores flags in a separate file. Flags are the only type of data
> > > > > that changes relatively often in the summary. Therefore a separate
> > > > > file.
> > >
> > > > Hmmm... FLAGS and MODSEQ. I'm not clear if you really need MODSEQ to
> > > > be stored at all in the client, mind.
> > >
> > > If the MODSEQ number is not often changed (if ever), it can be put in
> > > the mapped file. If it often changes, it can be put in the flags file as
> > > a new column there (like after the flags bitfield number).
> > >
> > > Of course is it sizeof(int) more heap memory per item. But if this field
> > > is needed in your client, it can of course be added (or seek()ed for in
> > > the property's getter if it's not often needed, and can therefore be
> > > slow).
> > >
> > > > > o. Stores "ftell()" offsets in a simplistic index file. Parsing this
> > > > >
> > > > Surely the offsets are fixed width, so a mmap() candidate?
> > >
> > > Width fixed widths the strings you store are either too limited in their
> > > length (like, 50 characters) or too wide for the typical ones (like 250
> > > chars or more).
> > >
> > > If you store strings uniquely and have some sort of index file anyway
> > > (like this experiment) then a fixed width for the strings themselves is
> > > not really needed I think. Just ftell() each time you write a string and
> > > update the index.
> > >
> > > > > o. Has a way to expunge an item (and the sequence numbers of higher
> > > > > items are adapted automatically)
> > > > > ...
> > > > >
> > > > And unsolicited FETCH FLAGS. Possibly others, too.
> > >
> > > Yep, unsolicited FETCH FLAGS can be coped with with this experimental
> > > implementation (or will be once I coded writing the flags file after a
> > > flags update), EXPUNGE and VANISHED too. An EXISTS followed by a FETCH
> > > ENVELOPE and also an unsolicited NOTIFY's ENVELOPE can be coped with
> > > too.
> > >
> > > Note that I want the flags file to be written less frequent than flags
> > > get set to items -> some way to delay the write of the file, because
> > > it's quite likely that subsequent flag changes take place in a short
> > > amount of time. Like a burst of unsolicited FETCH FLAGS or the user
> > > doing a "select all" and then "mark as read".
> > >
> > > > > o. Has a way to get an item by sequence number quickly (hashtable
> > > > > lookup)
> > >
> > > > >
> > > > Polymer uses two caches for this - one which is by UID, but contains
> > > > no msgno. The other is essentially a sparse list of UIDs.
> > >
> > > > Finding a sequence number from a UID is done by successive
> > > > approximation - it assumes that the graph of msgno vs UID
> > > > approximates a straight line, which astonishingly it does. It does a
> > > > little more cleverness by quickly checking the currently loaded
> > > > blocks, too, but that's the bulk of it.
> > >
> > > Right, sounds sensible to optimise this indeed. Right now I just store
> > > two separate hashtables that have their values point to the same items.
> > >
> > > It's indeed rather dumb and not very memory-efficient (a hashtable's
> > > node is quite big in size).
> > >
> > > The reason for these simplistic hashtables's-method was that out of
> > > order inserting of data was easy to implement this way (just insert new
> > > items to the table).
> > >
> > >
> > > Mindnote:
> > > A draw-back from a GPtrArray with a hashtable is the foreach speed which
> > > will be needed when copying the item pointers to the GtkTreeModel's
> > > store if that one is GPtrArray based: a foreach over a GPtrArray is of
> > > course just a for-loop, a foreach over a hashtable is (like) a doubly
> > > linked list-loop (does more dereferencing which is afaik slower than
> > > than just doing the i++ math in the for-loop).
> > >
> > >
> > > > The effect this'd have would be to (usually) map the first and last
> > > > blocks, and one other (the one the user was "looking" at). Quite
> > > > often, all three are the same.
> > >
> > > > So for tiny mailboxes - one or two blocks - it's a straightforward
> > > > successive approximation search through the mmapped block containing
> > > > the UID. (And that's easy to find).
> > >
> > > > For medium mailboxes, there's the rather nice property that, in
> > > > general, users get rid of mail fairly evenly throughout. (There's a
> > > > gentle upward curve, actually, as spam increases).
> > >
> > > Right.
> > >
> > > > For massive mailboxes, these turn out to be mailing list archives,
> > > > etc, where no expunges have happened at all, and the successive
> > > > approximation technique finds the UID with one probe.
> > >
> > >
> > > > > o. Has a way to create new items easily. Deals with out-of-order
> > > > > adding of items (when requesting an item with a sequence number that
> > > > > doesn't yet exist, you'll simply get NULL)
> > > > >
> > > > You probably want to be returning things other than NULL. You've got,
> > > > at least:
> > > >
> > > > - "This data cannot exist" - ie, the UID or msgno doesn't exist.
> > > > - "This data doesn't exist" - ie, the UID or msgno exists, but the
> > > > message does not have this data.
> > > > - "I don't have this data" - ie, the UID or msgno exists, but the
> > > > data hasn't been requested yet.
> > >
> > > So some sort of status or error code that you get when requesting an
> > > item. Something like this perhaps?
> > >
> > > typedef enum {
> > > notyetreceived,
> > > fine,
> > > nonexisting
> > > } SummaryItemState;
> > >
> > > typedef struct {
> > > ...
> > > } SummaryItem;
> > >
> > >
> > > Summary *s = ...
> > > SummaryItem *item = NULL;
> > > SummaryItemState state;
> > >
> > > summary_item_by_uid (s, "0001", &item, &state);
> > >
> > > if (!item && state == notyetreceived) {
> > > printf ("...\n");
> > > } else if (item && state == fine) {
> > > printf ("%s\n", summary_item_get_subject (item));
> > > }
> > >
> > > > > o. Scales to > 200,000 items easily
> > >
> > > > Cool. I'll be able to read my mail on Modest. It managed to
> > > > synchronize my work account yesterday, after several weeks.
> > >
> > > :-)
> > >
> > > Modest is modestly aimed at supporting < ~30,000 items on Nokia's
> > > devices. I have a fairly good idea why more items make the application
> > > too slow (mostly a combination of GtkTreeView-style live sorting while
> > > retrieving takes place and meanwhile periodic writing of the summary's
> > > mmapped file).
> > >
> > > Regretfully I don't have the funding to work on improving these problems
> > > nor are really large mailboxes a very high priority for Modest atm. This
> > > experimental summary implementation is a first step to trying to solve
> > > one of the two problems Modest right now has, with very large mailboxes.
> > >
> > > The demoui or TMut can cope with ~ 70,000 items on a N800 device, by the
> > > way. Just don't start sorting things.
> > >
> > > Obviously must Modest do sorting for the user. An end-user doesn't
> > > really care a lot about why a sorted summary makes his mobile E-mail
> > > client not cope with insanely large mailboxes. Except if he has one,
> > > like you have :D, of course.
> > >
> > > (Should be fairly easy to mock with Modest's source and shortcut te
> > > GtkTreeModelSortable stuff and just directly assign as the treeview's
> > > model the TnyGtkHeaderListModel -- search for that type --).
> > >
> > > > > o. Should be relatively thread safe (at least some attempts at
> > > > > making it
> > > > > that way are already in place). (although I'd love to use
> > > > > lock-free
> > > > > hashtables, I simply used mutexes for this)
> > > > >
> > > > >
> > > > I wouldn't get stressed about this. Instead, I'd suggest you get rid
> > > > of whatever inane threading is going on, and/or supply a wrapper API
> > > > which marshals requests to the single thread handling the cache and
> > > > protocol.
> > >
> > > Regretfully, though, you need to access the summary from both the UI
> > > thread and from the thread that is receiving stuff.
> > >
> > > You can't really marshal the requests coming from the UI thread unless
> > > you want to make things hideously slow for the user interface. You can,
> > > however, indeed, marshal the requests (to store new stuff) coming from
> > > the thread that is receiving new material (you queue it).
> > >
> > > Queueing for that thread is something I think the application itself (or
> > > in my case Tinymail, for that matter) should take care of, no? That's my
> > > current idea.
> > >
> > > -> Direct access to the summary from the UI's thread
> > > -> Queued access to the summary from the data receiving thread, perhaps
> > > queued in the UI's thread (which wouldn't require thread safety
> > > anymore then)
> > >
> > > > > o. A freeze/thaw API for Dave so that he can make multiple changes.
> > > > > This integrates with (#z) of course.
> > > >
> > > > Yes... I find it improves performance hugely when the user highlights
> > > > all messages and hits "delete".
> > >
> > > That should result in saving three empty files once the freeze API is
> > > called for. Will be fast, indeed.
> > >
> > >
> > > --
> > > Philip Van Hoof, freelance software developer
> > > home: me at pvanhoof dot be
> > > gnome: pvanhoof at gnome dot org
> > > http://pvanhoof.be/blog
> > > http://codeminded.be
> > >
> > > _______________________________________________
> > > tinymail-devel-list mailing list
> > > tinymail-devel-list gnome org
> > > http://mail.gnome.org/mailman/listinfo/tinymail-devel-list
> _______________________________________________
> tinymail-devel-list mailing list
> tinymail-devel-list gnome org
> http://mail.gnome.org/mailman/listinfo/tinymail-devel-list
--
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be
gnome: pvanhoof at gnome dot org
http://pvanhoof.be/blog
http://codeminded.be
#include <glib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "summary.h"
typedef struct _SummaryBlock SummaryBlock;
static inline void summary_block_kill (SummaryBlock *b);
static void summary_block_dump (SummaryBlock *b);
/* The size of a pointer in this documentation assumes a 32bit architecture. On
* your fancy 64bit computer will each pointer be larger! size_t and off_t will
* also be different! Recount to know the memory consumption on such archs. */
struct _SummaryItem {
/* We don't use all 32 bits of the flags,perhaps we can make it 16 bits
* too. although right now I don't have another field to pad the data
* with anyway. So I just keep it at 32 bits. */
guint32 flags; /* 4 bytes */
/* Sizes can be quite large, keep it 32 bits */
size_t size; /* 4 bytes */
/* A 16 bit sequence would mean maximum 65536 items: not enough, so
* let's stick that to 32 bit in stead. */
guint32 seq; /* 4 bytes */
/* padding trick { */
/* Blocks will have a high ref_count, but items wont, therefore are
* 16 bits fine for this field */
gint16 ref_count;
/* This is just a bool, a bitfield doesn't help, so I just made it
* 16 bits too. Together with the ref_count, it forms one word, saving
* us 4 bytes per item. */
gint16 mem;
/* } together: 32 bits, 4 bytes */
/* Let's now put all the pointers together, they'll each consume four
* bytes (we can't do a lot of padding optimizations here anyway) */
char *uid; /* 4 bytes */
SummaryBlock *block; /* 4 bytes */
union {
struct {
off_t from_offset;
off_t subject_offset;
off_t to_offset;
off_t cc_offset;
} offsets;
struct {
char *from;
char *subject;
char *to;
char *cc;
} pointers;
} mmappable; /* 16 bytes */
}; /* Frequence of this type is significant to care about its size: 40 bytes per
* item (plus heap admin, which is relatively small with gslice). Note that
* the VmRSS will also suffer from pages in the mapping that are needed! */
/* Also important for the size are the hashtable entries in uids and seqs.
* each such GHashNode is at least 16 bytes. Multiplied by two hashtables
* that's 32 extra bytes. On top of that we store the uid twice in memory:
* once as key and once in the item. ~12 bytes x 2= 24 bytes.*/
/* So 40+32+24+mapping */
/* Everything added I estimate that each item will consume between 100 and
* 200 bytes in VmRSS (around 4 MB for 20,000 items) (TODO: accurately
* measure this) */
struct _SummaryBlock {
int ref_count;
char *data;
FILE *data_file;
int mmap_size;
int block_id;
int written;
GHashTable *uids;
GHashTable *seqs;
GStaticRecMutex *lock;
}; /* Frequence of this type is not significant enough to care about its size */
struct _Summary {
GPtrArray *blocks;
GStaticRecMutex *lock;
}; /* Frequence of this type is one per folder: don't care about its size*/
static inline char*
summary_block_index_filename (SummaryBlock *b)
{
return g_strdup_printf ("index_%d.idx", b->block_id);
}
static inline char*
summary_block_flags_filename (SummaryBlock *b)
{
return g_strdup_printf ("flags_%d.idx", b->block_id);
}
static inline char*
summary_block_data_filename (SummaryBlock *b)
{
return g_strdup_printf ("data_%d.mmap", b->block_id);
}
static inline void
prepare_summary_block (SummaryBlock *b)
{
b->uids = g_hash_table_new_full (g_str_hash, g_str_equal, (GDestroyNotify) g_free, NULL);
b->seqs = g_hash_table_new_full (g_int_hash, g_int_equal, NULL, NULL);
}
static inline SummaryBlock*
create_new_summary_block (int block_id)
{
SummaryBlock *b = g_slice_new0 (SummaryBlock);
b->lock = g_new0 (GStaticRecMutex, 1);
g_static_rec_mutex_init (b->lock);
g_static_rec_mutex_lock (b->lock);
b->block_id = block_id;
b->written = FALSE;
prepare_summary_block (b);
g_static_rec_mutex_unlock (b->lock);
return b;
}
static inline void
summary_block_add_item_int (SummaryBlock *b, SummaryItem *i)
{
g_static_rec_mutex_lock (b->lock);
i->block = b;
b->ref_count++;
g_hash_table_insert (b->uids, g_strdup (i->uid), i);
g_hash_table_insert (b->seqs, &i->seq, i);
g_static_rec_mutex_unlock (b->lock);
}
static inline void
summary_block_add_item (SummaryBlock *b, SummaryItem *i)
{
g_static_rec_mutex_lock (b->lock);
if (i->block && i->block != b) {
g_static_rec_mutex_lock (i->block->lock);
i->block->written = FALSE;
g_hash_table_remove (i->block->uids, i->uid);
g_hash_table_remove (i->block->seqs, &i->seq);
i->block->ref_count--;
if (i->block->ref_count <= 0)
summary_block_kill (i->block);
g_static_rec_mutex_unlock (i->block->lock);
}
i->block = b;
b->ref_count++;
b->written = FALSE;
/* TODO: insert extended and remove original if dup */
g_hash_table_insert (b->uids, g_strdup (i->uid), i);
g_hash_table_insert (b->seqs, &i->seq, i);
g_static_rec_mutex_unlock (b->lock);
}
void
summary_add_item (Summary *s, SummaryItem *i)
{
int block_id = 0; //i->seq % 1000;
SummaryBlock *b;
g_static_rec_mutex_lock (s->lock);
if (s->blocks->len < block_id+1) {
int t;
for (t = s->blocks->len; t < block_id+1; t++) {
SummaryBlock *mnew = create_new_summary_block (t);
g_ptr_array_add (s->blocks, mnew);
}
}
b = (SummaryBlock *) s->blocks->pdata[block_id];
summary_block_add_item (b, i);
g_static_rec_mutex_unlock (s->lock);
}
static inline SummaryItem*
summary_block_get_item_by_seq (SummaryBlock *b, int seq)
{
return (SummaryItem *) g_hash_table_lookup (b->seqs, &seq);
}
SummaryItem*
summary_get_item_by_seq (Summary *s, int seq)
{
int i;
SummaryItem *item = NULL;
g_static_rec_mutex_lock (s->lock);
for (i = 0; i < s->blocks->len; i++) {
SummaryBlock *b = (SummaryBlock *) s->blocks->pdata[i];
item = summary_block_get_item_by_seq (b, seq);
if (item) {
item->ref_count++;
item->block->ref_count++;
break;
}
}
g_static_rec_mutex_unlock (s->lock);
return item;
}
static inline SummaryItem*
summary_block_get_item_by_uid (SummaryBlock *b, const char *uid)
{
return (SummaryItem *) g_hash_table_lookup (b->uids, uid);
}
SummaryItem*
summary_get_item_by_uid (Summary *s, const char *uid)
{
int i;
SummaryItem *item = NULL;
g_static_rec_mutex_lock (s->lock);
for (i = 0; i < s->blocks->len; i++) {
SummaryBlock *b = (SummaryBlock *) s->blocks->pdata[i];
item = summary_block_get_item_by_uid (b, uid);
if (item) {
item->ref_count++;
item->block->ref_count++;
break;
}
}
g_static_rec_mutex_unlock (s->lock);
return item;
}
static inline void
summary_expunge_item_shared (SummaryBlock *block, SummaryItem *i)
{
GList *values; gint seq = i->seq;
g_hash_table_remove (block->uids, i->uid);
g_hash_table_remove (block->seqs, &i->seq);
values = g_hash_table_get_values (block->uids);
while (values) {
SummaryItem *item = values->data;
if (item->seq > seq) {
g_hash_table_remove (block->seqs, &item->seq);
item->seq--;
g_hash_table_insert (block->seqs, &item->seq, item);
}
values = g_list_next (values);
}
g_list_free (values);
}
void
summary_expunge_item_by_uid (Summary *s, const char *uid)
{
int i;
SummaryItem *item = NULL;
g_static_rec_mutex_lock (s->lock);
for (i = 0; i < s->blocks->len; i++) {
SummaryBlock *b = (SummaryBlock *) s->blocks->pdata[i];
item = summary_block_get_item_by_uid (b, uid);
if (item) {
summary_expunge_item_shared (b, item);
break;
}
}
g_static_rec_mutex_unlock (s->lock);
return;
}
void
summary_expunge_item_by_seq (Summary *s, int seq)
{
int i;
SummaryItem *item = NULL;
g_static_rec_mutex_lock (s->lock);
for (i = 0; i < s->blocks->len; i++) {
SummaryBlock *b = (SummaryBlock *) s->blocks->pdata[i];
item = summary_block_get_item_by_seq (b, seq);
if (item) {
summary_expunge_item_shared (b, item);
break;
}
}
g_static_rec_mutex_unlock (s->lock);
return;
}
static inline SummaryBlock *
read_summary_block (int block_id, SummaryBlock *b_in)
{
FILE *idx, *flags;
SummaryBlock *b;
struct stat buf;
char *index_filename;
char *data_filename;
char *flags_filename;
if (b_in)
b = b_in;
else {
b = g_slice_new0 (SummaryBlock);
b->lock = g_new0 (GStaticRecMutex, 1);
prepare_summary_block (b);
g_static_rec_mutex_init (b->lock);
}
g_static_rec_mutex_lock (b->lock);
b->block_id = block_id;
b->written = TRUE;
index_filename = summary_block_index_filename (b);
idx = fopen (index_filename, "r");
if (!idx) {
b->written = FALSE;
g_static_rec_mutex_unlock (b->lock);
return b;
}
data_filename = summary_block_data_filename (b);
if (b->data_file)
fclose (b->data_file);
if (b->data)
munmap (b->data, b->mmap_size);
b->data_file = fopen (data_filename, "r");
if (!b->data_file) {
b->written = FALSE;
fclose (idx);
g_free (index_filename);
g_free (data_filename);
g_static_rec_mutex_unlock (b->lock);
return b;
}
flags_filename = summary_block_flags_filename (b);
flags = fopen (flags_filename, "r");
if (!flags) {
b->written = FALSE;
fclose (idx);
fclose (b->data_file);
b->data_file = NULL;
g_free (index_filename);
g_free (data_filename);
g_free (flags_filename);
g_static_rec_mutex_unlock (b->lock);
return b;
}
g_free (index_filename);
g_free (data_filename);
g_free (flags_filename);
fstat(fileno (b->data_file), &buf);
b->mmap_size = buf.st_size;
b->data = mmap (NULL, b->mmap_size, PROT_READ, MAP_PRIVATE, fileno (b->data_file), 0);
while (!feof (idx)) {
int len;
if (fscanf (idx, "%d", &len) != EOF)
{
SummaryItem *cur;
char *uid = (char *) malloc(len);
gboolean do_add = FALSE;
memset (uid, 0, len);
fgetc (idx);
fread (uid, 1, len, idx);
uid[len] = '\0';
cur = summary_block_get_item_by_uid (b, uid);
if (!cur) {
do_add = TRUE;
cur = g_slice_new0 (SummaryItem);
cur->ref_count = 1;
}
cur->uid = uid;
cur->block = b;
if (cur->mem == 1) {
g_free (cur->mmappable.pointers.from);
g_free (cur->mmappable.pointers.subject);
g_free (cur->mmappable.pointers.to);
g_free (cur->mmappable.pointers.cc);
}
cur->mem = 0;
fscanf (idx, "%d%d%d%d%d%d",
&cur->seq, &cur->size,
(int *) &cur->mmappable.offsets.subject_offset,
(int *) &cur->mmappable.offsets.from_offset,
(int *) &cur->mmappable.offsets.to_offset,
(int *) &cur->mmappable.offsets.cc_offset);
if (do_add)
summary_block_add_item_int (b, cur);
}
}
fclose (idx);
/* Perhaps scan this in a table and loop the table in above loop in stead */
while (!feof (flags)) {
int seq, flag;
if (fscanf (flags, "%d%d", &seq, &flag) != EOF) {
SummaryItem *c = summary_block_get_item_by_seq (b, seq);
if (c)
c->flags = flag;
}
}
fclose(flags);
g_static_rec_mutex_unlock (b->lock);
return b;
}
const char*
summary_item_get_uid (SummaryItem *i)
{
return i->uid;
}
int
summary_item_get_seq (SummaryItem *i)
{
return (const int) i->seq;
}
int
summary_item_get_flags (SummaryItem *i)
{
return (const int) i->flags;
}
void
summary_item_set_flags (SummaryItem *i, int new_flags)
{
i->flags = new_flags;
/* todo: rewrite flags file */
return;
}
const char*
summary_item_get_subject (SummaryItem *i)
{
const char* retval;
if (i->block)
g_static_rec_mutex_lock (i->block->lock);
if (i->mem == 1)
retval = i->mmappable.pointers.subject;
else
retval = i->block->data + i->mmappable.offsets.subject_offset;
if (i->block)
g_static_rec_mutex_unlock (i->block->lock);
return retval;
}
const char*
summary_item_get_from (SummaryItem *i)
{
const char *retval;
if (i->block)
g_static_rec_mutex_lock (i->block->lock);
if (i->mem == 1)
retval = i->mmappable.pointers.from;
else
retval = i->block->data + i->mmappable.offsets.from_offset;
if (i->block)
g_static_rec_mutex_unlock (i->block->lock);
return retval;
}
const char*
summary_item_get_to (SummaryItem *i)
{
const char *retval;
if (i->block)
g_static_rec_mutex_lock (i->block->lock);
if (i->mem == 1)
retval = i->mmappable.pointers.to;
else
retval = i->block->data + i->mmappable.offsets.to_offset;
if (i->block)
g_static_rec_mutex_unlock (i->block->lock);
return retval;
}
const char*
summary_item_get_cc (SummaryItem *i)
{
const char *retval;
if (i->block)
g_static_rec_mutex_lock (i->block->lock);
if (i->mem == 1)
retval = i->mmappable.pointers.cc;
else
retval = i->block->data + i->mmappable.offsets.cc_offset;
if (i->block)
g_static_rec_mutex_unlock (i->block->lock);
return retval;
}
static inline void
summary_block_kill (SummaryBlock *b)
{
if (!b->written)
summary_block_dump (b);
if (b->data)
munmap (b->data, b->mmap_size);
if (b->data_file)
fclose (b->data_file);
g_hash_table_destroy (b->seqs);
g_hash_table_destroy (b->uids);
b->seqs = NULL;
b->uids = NULL;
g_free (b->lock);
g_slice_free (SummaryBlock, b);
}
SummaryItem*
summary_item_ref (SummaryItem *i)
{
i->ref_count++;
if (i->block) {
g_static_rec_mutex_lock (i->block->lock);
i->block->ref_count++;
g_static_rec_mutex_unlock (i->block->lock);
}
return i;
}
void
summary_item_unref (SummaryItem *i)
{
i->ref_count--;
if (i->block) {
g_static_rec_mutex_lock (i->block->lock);
i->block->ref_count--;
if (i->block->ref_count <= 0)
summary_block_kill (i->block);
g_static_rec_mutex_unlock (i->block->lock);
}
if (i->ref_count <= 0)
{
g_free (i->uid);
if (i->mem == 1) {
g_free (i->mmappable.pointers.from);
g_free (i->mmappable.pointers.subject);
g_free (i->mmappable.pointers.to);
g_free (i->mmappable.pointers.cc);
i->mem = 0;
}
g_slice_free (SummaryItem, i);
}
}
typedef struct {
char *str;
int cnt, cpy;
GList *affected;
} StoreString;
static gint
cmp_storestring (gconstpointer a, gconstpointer b)
{
StoreString *sa = (StoreString *) a;
StoreString *sb = (StoreString *) b;
return sb->cnt - sa->cnt;
}
static void
summary_block_dump (SummaryBlock *b)
{
FILE *data, *idx, *flags;
GList *orig = g_hash_table_get_values (b->seqs), *items = orig;
GHashTable *strings = g_hash_table_new_full (g_str_hash, g_str_equal, g_free, NULL);
GList *sstrings;
char *index_filename;
char *data_filename;
char *flags_filename;
char *index_filename_n;
char *data_filename_n;
char *flags_filename_n;
g_static_rec_mutex_lock (b->lock);
if (b->written) {
g_static_rec_mutex_unlock (b->lock);
return;
}
index_filename = summary_block_index_filename (b);
data_filename = summary_block_data_filename (b);
flags_filename = summary_block_flags_filename (b);
index_filename_n = g_strdup_printf ("%s~", index_filename);
data_filename_n = g_strdup_printf ("%s~", data_filename);
flags_filename_n = g_strdup_printf ("%s~", flags_filename);
idx = fopen (index_filename_n, "w");
data = fopen (data_filename_n, "w");
flags = fopen (flags_filename_n, "w");
while (items) {
SummaryItem *item = (SummaryItem *) items->data;
StoreString *sb;
char *str;
/* subject */
if (!item->mem)
str = item->block->data + item->mmappable.offsets.subject_offset;
else
str = item->mmappable.pointers.subject;
sb = g_hash_table_lookup (strings, str);
if (sb) {
sb->cnt++;
sb->cpy = FALSE;
} else {
sb = g_slice_new0 (StoreString);
sb->cnt = 1;
sb->str = item->mem ? g_strdup (item->mmappable.pointers.subject) : str;
sb->cpy = item->mem;
g_hash_table_insert (strings, g_strdup (str), sb);
}
sb->affected = g_list_prepend (sb->affected, &item->mmappable.offsets.subject_offset);
/* from */
if (!item->mem)
str = item->block->data + item->mmappable.offsets.from_offset;
else
str = item->mmappable.pointers.from;
sb = g_hash_table_lookup (strings, str);
if (sb) {
sb->cnt++;
sb->cpy = FALSE;
} else {
sb = g_slice_new0 (StoreString);
sb->cnt = 1;
sb->str = item->mem ? g_strdup (item->mmappable.pointers.from) : str;
sb->cpy = item->mem;
g_hash_table_insert (strings, g_strdup (str), sb);
}
sb->affected = g_list_prepend (sb->affected, &item->mmappable.offsets.from_offset);
/* to */
if (!item->mem)
str = item->block->data + item->mmappable.offsets.to_offset;
else
str = item->mmappable.pointers.to;
sb = g_hash_table_lookup (strings, str);
if (sb) {
sb->cnt++;
sb->cpy = FALSE;
} else {
sb = g_slice_new0 (StoreString);
sb->cnt = 1;
sb->str = item->mem ? g_strdup (item->mmappable.pointers.to) : str;
sb->cpy = item->mem;
g_hash_table_insert (strings, g_strdup (str), sb);
}
sb->affected = g_list_prepend (sb->affected, &item->mmappable.offsets.to_offset);
/* cc */
if (!item->mem)
str = item->block->data + item->mmappable.offsets.cc_offset;
else
str = item->mmappable.pointers.cc;
sb = g_hash_table_lookup (strings, str);
if (sb) {
sb->cnt++;
sb->cpy = FALSE;
} else {
sb = g_slice_new0 (StoreString);
sb->cnt = 1;
sb->str = item->mem ? g_strdup (item->mmappable.pointers.cc) : str;
sb->cpy = item->mem;
g_hash_table_insert (strings, g_strdup (str), sb);
}
sb->affected = g_list_prepend (sb->affected, &item->mmappable.offsets.cc_offset);
if (item->mem == 1) {
g_free (item->mmappable.pointers.from);
g_free (item->mmappable.pointers.subject);
g_free (item->mmappable.pointers.to);
g_free (item->mmappable.pointers.cc);
item->mem = 0;
}
items = g_list_next (items);
}
sstrings = g_hash_table_get_values (strings);
sstrings = g_list_sort (sstrings, cmp_storestring);
while (sstrings) {
StoreString *ss = (StoreString *) sstrings->data;
int location = ftell (data);
fputs (ss->str, data);
if (ss->cpy) {
g_free (ss->str);
ss->cpy = FALSE;
}
fputc ('\0', data);
while (ss->affected) {
int *data = ss->affected->data;
*data = location;
ss->affected = g_list_next (ss->affected);
}
g_list_free (ss->affected);
g_slice_free (StoreString, ss);
sstrings = g_list_next (sstrings);
}
g_list_free (sstrings);
g_hash_table_destroy (strings);
fclose (data);
items = orig;
while (items) {
SummaryItem *item = (SummaryItem *) items->data;
fprintf (flags, "%d %d\n", item->seq, item->flags);
fprintf (idx, "%d %s %d %d %d %d %d %d\n",
strlen (item->uid), item->uid,
item->seq, item->size,
(int) item->mmappable.offsets.subject_offset,
(int) item->mmappable.offsets.from_offset,
(int) item->mmappable.offsets.to_offset,
(int) item->mmappable.offsets.cc_offset);
items = g_list_next (items);
}
g_list_free (orig);
fclose (idx);
fclose (flags);
if (b->data)
munmap (b->data, b->mmap_size);
if (b->data_file)
fclose (b->data_file);
b->data = NULL;
b->data_file = NULL;
rename (index_filename_n, index_filename);
rename (data_filename_n, data_filename);
rename (flags_filename_n, flags_filename);
/* Reread it (this copes with existing items too) */
read_summary_block (b->block_id, b);
g_free (index_filename);
g_free (data_filename);
g_free (flags_filename);
g_free (index_filename_n);
g_free (data_filename_n);
g_free (flags_filename_n);
g_static_rec_mutex_unlock (b->lock);
return;
}
static void
summary_dump (Summary *summary)
{
int i;
g_static_rec_mutex_lock (summary->lock);
for (i = 0; i < summary->blocks->len; i++) {
SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
summary_block_dump (b);
}
g_static_rec_mutex_unlock (summary->lock);
return;
}
static inline void
foreach_item_free (gpointer key, gpointer value, gpointer user_data)
{
SummaryItem *i = (SummaryItem *) value;
summary_item_unref (i);
}
static inline void
summary_block_free (SummaryBlock *b)
{
g_hash_table_foreach (b->seqs, foreach_item_free, NULL);
}
void
summary_free (Summary *summary)
{
int i;
g_static_rec_mutex_lock (summary->lock);
for (i = 0; i < summary->blocks->len; i++) {
SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
g_static_rec_mutex_lock (b->lock);
summary_block_free (b);
b->ref_count--;
if (b->ref_count <= 0)
summary_block_kill (b);
g_static_rec_mutex_unlock (b->lock);
}
g_ptr_array_free (summary->blocks, TRUE);
g_static_rec_mutex_unlock (summary->lock);
g_free (summary->lock);
g_slice_free (Summary, summary);
return;
}
Summary*
summary_open (const char *folder_path)
{
Summary *summary = g_slice_new0 (Summary);
summary->lock = g_new0 (GStaticRecMutex, 1);
int i, items;
char *filename = g_strdup_printf ("%s/summary.idx", folder_path);
FILE *f = fopen (filename, "r");
g_free (filename);
g_static_rec_mutex_init (summary->lock);
fscanf (f, "%d", &items);
g_static_rec_mutex_lock (summary->lock);
summary->blocks = g_ptr_array_sized_new (items);
for (i=0; i < items; i++) {
SummaryBlock *b = read_summary_block (i, NULL);
g_ptr_array_add (summary->blocks, b);
b->ref_count++;
}
g_static_rec_mutex_unlock (summary->lock);
return summary;
}
static inline int
cmpstringp(const void *p1, const void *p2)
{
return strcmp(* (char * const *) p1, * (char * const *) p2);
}
static GPtrArray*
split_recipients (gchar *buffer)
{
gchar *tmp, *start;
gboolean is_quoted = FALSE;
GPtrArray *array = g_ptr_array_new ();
start = tmp = buffer;
while (*tmp != '\0') {
if (*tmp == '\"')
is_quoted = !is_quoted;
if (*tmp == '\\')
tmp++;
if ((!is_quoted) && ((*tmp == ',') || (*tmp == ';'))) {
gchar *part;
part = g_strndup (start, tmp - start);
part = g_strstrip (part);
g_ptr_array_add (array, part);
start = tmp+1;
}
tmp++;
}
if (start != tmp)
g_ptr_array_add (array, g_strstrip (g_strdup (start)));
return array;
}
/* In reality will this be done by the application in stead (probably by the ENVELOPE parser!) */
static gchar *
sort_address_list (const gchar *unsorted)
{
GPtrArray *a = split_recipients ((gchar*)unsorted);
gint i; GString *result = g_string_new ("");
gchar *item;
g_ptr_array_sort (a, (GCompareFunc) cmpstringp);
for (i=0; i < a->len; i++) {
if (i > 0) {
g_string_append_c (result, ',');
g_string_append_c (result, ' ');
}
g_string_append (result, a->pdata[i]);
g_free (a->pdata[i]);
}
g_ptr_array_free (a, TRUE);
item = result->str;
g_string_free (result, FALSE);
return item;
}
SummaryItem*
summary_create_item (const char *uid, int seq, int flags, size_t size, const char *from, const char *subject, const char *to, const char *cc)
{
SummaryItem *item = g_slice_new0 (SummaryItem);
item->ref_count = 1;
item->size = size;
item->seq = seq;
item->flags = flags;
item->uid = g_strdup (uid);
item->mmappable.pointers.from = g_strdup (from);
item->mmappable.pointers.to = sort_address_list (to);
item->mmappable.pointers.cc = sort_address_list (cc);
item->mmappable.pointers.subject = g_strdup (subject);
item->block = NULL;
item->mem = 1; /* memory item */
return item;
}
static void
foreach_item_print_it (gpointer key, gpointer value, gpointer user_data)
{
SummaryItem *mc = (SummaryItem *) value;
printf ("UID %s is at sequence %d with flags %d\n",
summary_item_get_uid (mc),
summary_item_get_seq (mc),
summary_item_get_flags (mc));
printf ("\tFrom=%s\n\tSubj=%s\n\tCc=%s\n\tTo=%s\n",
summary_item_get_from (mc),
summary_item_get_subject (mc),
summary_item_get_cc (mc),
summary_item_get_to (mc));
}
static void
print_summary (Summary *summary)
{
int i;
for (i = 0; i < summary->blocks->len; i++) {
SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
g_hash_table_foreach (b->seqs, foreach_item_print_it, NULL);
}
}
int
main (int argc, char **argv)
{
Summary *summary = summary_open (".");
int i, y = 0;
GList *vals = NULL;
/* printf ("sizeof (SummaryItem): %d\n", sizeof (SummaryItem)); */
/* print_summary (summary); */
for (i = 0; i < summary->blocks->len; i++) {
gint z = 0;
SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
GList *l = g_hash_table_get_values (b->seqs);
while (l) {
if (z % 2) {
SummaryItem *i = summary_item_ref (l->data);
vals = g_list_prepend (vals, i);
}
l = g_list_next (l);
z++;
}
g_list_free (l);
}
if (argc > 1) {
for (i = 0; i < summary->blocks->len; i++) {
SummaryBlock *b = (SummaryBlock *) summary->blocks->pdata[i];
y += g_hash_table_size (b->seqs);
}
for (i = y; i < y + 50000; i++) {
const char *uid = g_strdup_printf ("%08d", i);
int seq = 10+i;
int flags = i+1;
size_t size = i+2;
char *from = g_strdup_printf ("Jan%dtje <jantje gmail com>",i);
const char *subject = g_strdup_printf ("Ik ga naar %dde bakker he Pietje!",i);
const char *to = g_strdup_printf ("Gr%dietje <grietje gmail com>, Ha%dnsje <hansje gmail com>, Pietje <pi%detje gmail com>",i,i,i);
const char *cc = g_strdup_printf ("Pietje <p%dietje gmail com>, Hansje <hansj%de gmail com>, Grietje <gr%dietje gmail com>", i,i,i);
SummaryItem *item = summary_create_item (uid, seq, flags, size, from, subject, to, cc);
summary_add_item (summary, item);
}
summary_dump (summary);
printf ("\n\n\n\n");
print_summary (summary);
}
if (FALSE) {
printf ("\n\n---\n\n");
SummaryItem *item1 = summary_get_item_by_seq (summary, 22);
printf ("%s\n", summary_item_get_from (item1));
summary_expunge_item_by_seq (summary, 22);
item1 = summary_get_item_by_seq (summary, 22);
if (item1) {
printf ("%s\n", summary_item_get_from (item1));
}
getchar();
}
summary_free (summary);
getchar();
/* Print the odd ones */
while (vals) {
SummaryItem *i = vals->data;
foreach_item_print_it (&i->seq, i, NULL);
summary_item_unref (i);
vals = g_list_next (vals);
}
g_list_free (vals);
printf ("The blocks should be freed now\n");
return 0;
}
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]