[Evolution-hackers] mmap() for the summary file



Hi there,

I've been trying to replace the fread()/fopen() implementation in
camel-folder-summary.c with an mmap() one.

I know camel-file-utils.c will put duplicate strings in a hashtable and
that way reduce memory usage for the summary information. Because a lot
mail boxes have duplicate strings for the From and To headers. I know
why and how this is implemented. And I understand that this already
reduces memory usage a lot.

However. On a small device with few memory resources, the kernel knows
better when to allocate and when to deallocate uncachable data like this
summary information.

Therefore I propose to replace the implementation with mmap(). Not only
I propose it, I also already tried it myself.

While trying this, I came to the conclusion that it *would* be possible
if the strings would have been terminated by '\0' in stead of being
stored pascal-like using an encoded unsigned 32 bit integer in front of
the string data.

That decision makes this (using the current file format) impossible,
unless the mmap'd memory (and therefore also the file on disk) is
constantly rewritten (with '\0' characters) or unless the entire
infrastructure that uses the summary strings is adapted to use this
length information rather than using the strings directly from the
mmap'ed memory *as* NULL terminated C strings (char arrays with a NULL
termination). The second solution implies that all would have to be
converted to GString's.

I think it would reduce memory usage of Evolution with ~40mb (depending
on the total amount of summary information being loaded). It would make
the sorting of the header summary view a little bit slower on certain
machines (mainly on machines that have very few memory resources left,
so that the kernel will not put a lot of this mmap'ed data in its
buffers/cache).

The file format should be adapted in two ways:

- Duplicate strings will need to be stored at only one location *on the
disk*. So the hashtable implementation wouldn't be a memory-only but
also a in-the-summary-file something.

For example: A string-field can be a pointer to the first character of
the string, or a pointer to another location in the file (in the mmap).

- Strings will need to be '\0' terminated *in the file* so that they are
directly usable from the mmap() memory block. 


Who are the brave souls that want to join me with this brain-damaging
idea? And would a change like this (which would mean that a migration
procedure each time an old folder-summary is loaded would need to run)
ever get upstream?

I measured (using valgrind) that most of the Evolution memory usages
goes to storing a in-memory version of the summary files. I also
measured that there's quite a lot memory segmentation going on (while
loading the summary file) and that it (the memory for the file) consumes
~ twice as much as the on-filesystem filesize of the summary file.

Loading using mmap() would be faster and wouldn't consume as much real
memory (it would consume a mmap, that is true, and that memory would
most likely go to the buffers/cache which the kernel manages, that is
also true). Sorting might become a little bit slower (but probably not
noticeable on most desktop hardware).

I'm being serious, I would like to waste my time on this one. If the
camel team of Evolution likes the idea (and wouldn't mind wasting some
of their time on it as well). If not ... I'd rather wait for the
disk-summary branch or for libspruce than to waste my time with it.
Because forking camel would mean wasting huge amounts of time on
maintaining a fork.

I attached a patch with my current tryout. I already load the header of
the summary file using mmap. That is already working. The difficult part
is, however, making the strings themselves usable. Because those aren't
NULL terminated. But please check the patch, you'll immediately see what
I mean.

Copying the strings, and NULL terminating the copy, is not a good option
because that would make the entire mmap-concept pointless (you still
copy it to real memory, so the entire reason-for-mmap then gone). Note
that this is what the current implementation also does: it copies the
string and null terminates the copy. And then frees the malloc that was
allocated for reading it from the file.

In fact is that copy unnecessary. Since fread() is a copy (and not like
mmap also real on-disk data), it wouldn't matter if you'd use the
original malloc()'d memory. This memory copying is probably causing the
memory segmentation I mentioned above. If you'd implement it like this,
you'd better at least used gslice.

But anyway ;)


-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be
? camel-mime-tables.c
? finalise
Index: camel-folder-summary.c
===================================================================
RCS file: /cvs/gnome/evolution-data-server/camel/camel-folder-summary.c,v
retrieving revision 1.148
diff -u -r1.148 camel-folder-summary.c
--- camel-folder-summary.c	12 Apr 2006 19:14:13 -0000	1.148
+++ camel-folder-summary.c	11 Jun 2006 13:48:11 -0000
@@ -33,6 +33,10 @@
 #include <errno.h>
 #include <ctype.h>
 
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+
 #include "libedataserver/e-iconv.h"
 
 #include "camel-folder.h"
@@ -67,7 +71,7 @@
 #define USE_BSEARCH
 
 #define d(x)
-#define io(x)			/* io debug */
+#define io(x)	x			/* io debug */
 #define w(x)
 
 #if 0
@@ -123,6 +127,7 @@
 
 	p->filter_charset = g_hash_table_new (camel_strcase_hash, camel_strcase_equal);
 
+	s->summary_mapping = NULL;
 	s->message_info_size = sizeof(CamelMessageInfoBase);
 	s->content_info_size = sizeof(CamelMessageContentInfo);
 
@@ -540,31 +545,41 @@
 int
 camel_folder_summary_load(CamelFolderSummary *s)
 {
-	FILE *in;
-	int i;
+	int i, fd;
 	CamelMessageInfo *mi;
 
 	if (s->summary_path == NULL)
 		return 0;
+	struct stat summary_stat;
 
-	in = g_fopen(s->summary_path, "rb");
-	if (in == NULL)
-		return -1;
+	stat(s->summary_path, &summary_stat);
+	s->summary_filedes = g_fopen(s->summary_path, "rb");
+	s->summary_mapping = mmap ((caddr_t)0, summary_stat.st_size,
+			PROT_READ | PROT_WRITE, MAP_SHARED, fileno (s->summary_filedes), 0);
+
+printf ("MMAP %s 0x%02x\n", s->summary_mapping, s->summary_mapping);
+//	if (in == NULL)
+//		return -1;
 
 	CAMEL_SUMMARY_LOCK(s, io_lock);
-	if ( ((CamelFolderSummaryClass *)(CAMEL_OBJECT_GET_CLASS(s)))->summary_header_load(s, in) == -1)
-		goto error;
+	if ( ((CamelFolderSummaryClass *)(CAMEL_OBJECT_GET_CLASS(s)))->summary_header_load(s, s->summary_filedes) == -1)
+		//goto error;
+		printf ("DONE\n");
 
+	printf ("saved count (%d)\n", s->saved_count);
+		
 	/* now read in each message ... */
 	for (i=0;i<s->saved_count;i++) {
-		mi = ((CamelFolderSummaryClass *)(CAMEL_OBJECT_GET_CLASS(s)))->message_info_load(s, in);
+		
+		printf ("DOING %d\n", i);
+		mi = ((CamelFolderSummaryClass *)(CAMEL_OBJECT_GET_CLASS(s)))->message_info_load(s, s->summary_filedes);
 
 		if (mi == NULL)
 			goto error;
 
 		/* FIXME: this should be done differently, how i don't know */
 		if (s->build_content) {
-			((CamelMessageInfoBase *)mi)->content = perform_content_info_load(s, in);
+			((CamelMessageInfoBase *)mi)->content = perform_content_info_load(s, s->summary_filedes);
 			if (((CamelMessageInfoBase *)mi)->content == NULL) {
 				camel_message_info_free(mi);
 				goto error;
@@ -576,19 +591,21 @@
 
 	CAMEL_SUMMARY_UNLOCK(s, io_lock);
 	
-	if (fclose (in) != 0)
-		return -1;
+	//if (fclose (in) != 0)
+	//	return -1;
 
 	s->flags &= ~CAMEL_SUMMARY_DIRTY;
 
 	return 0;
 
 error:
+	printf ("IN ERROR\n");
 	if (errno != EINVAL)
 		g_warning ("Cannot load summary file: `%s': %s", s->summary_path, g_strerror (errno));
 	
 	CAMEL_SUMMARY_UNLOCK(s, io_lock);
-	fclose (in);
+	//munmap(s->summary_mapping, s->summary_stat.st_size);
+	//fclose (in);
 	s->flags |= ~CAMEL_SUMMARY_DIRTY;
 
 	return -1;
@@ -1392,41 +1409,48 @@
 static int
 summary_header_load(CamelFolderSummary *s, FILE *in)
 {
-	fseek(in, 0, SEEK_SET);
-
 	io(printf("Loading header\n"));
 
-	if (camel_file_util_decode_fixed_int32(in, &s->version) == -1)
-		return -1;
+	gint32 *ptr32 = s->summary_mapping;
+	time_t *ptrt;
+	
+	s->version = g_ntohl (*ptr32); ptr32++; 
 
 	/* Legacy version check, before version 12 we have no upgrade knowledge */
 	if ((s->version > 0xff) && (s->version & 0xff) < 12) {
-		io(printf ("Summary header version mismatch"));
+		io(printf ("Summary header version mismatch %d", (s->version&0xff)));
 		errno = EINVAL;
 		return -1;
 	}
 
 	if (!(s->version < 0x100 && s->version >= 13))
-		io(printf("Loading legacy summary\n"));
+		io(printf("Loading legacy summary %d %d\n", s->version, (s->version&0xff)));
 	else
-		io(printf("loading new-format summary\n"));
+		io(printf("loading new-format summary %d\n", s->version));
 
-	/* legacy version */
-	if (camel_file_util_decode_fixed_int32(in, &s->flags) == -1
-	    || camel_file_util_decode_fixed_int32(in, &s->nextuid) == -1
-	    || camel_file_util_decode_time_t(in, &s->time) == -1
-	    || camel_file_util_decode_fixed_int32(in, &s->saved_count) == -1) {
-		return -1;
-	}
+
+	s->flags = g_ntohl (*ptr32); ptr32++; 
+	s->nextuid = g_ntohl (*ptr32); ptr32++; 
+	ptrt = (time_t*)ptr32;
+	s->time = *ptrt; ptrt++; 
+	ptr32 = (gint32*)ptrt;
+	s->saved_count = g_ntohl (*ptr32); ptr32++; 
 
 	/* version 13 */
-	if (s->version < 0x100 && s->version >= 13
-	    && (camel_file_util_decode_fixed_int32(in, &s->unread_count) == -1
-		|| camel_file_util_decode_fixed_int32(in, &s->deleted_count) == -1
-		|| camel_file_util_decode_fixed_int32(in, &s->junk_count) == -1)) {
-		return -1;
+	if (s->version < 0x100 && s->version >= 13)
+	{
+		printf ("v13\n");
+		s->unread_count = g_ntohl (*ptr32); ptr32++; 
+		s->deleted_count = g_ntohl (*ptr32); ptr32++; 
+		s->junk_count = g_ntohl (*ptr32); ptr32++; 
 	}
 
+	io(printf ("CAMEL: ur:%d, dc:%d, jc:%d, v:%d\n",
+		s->unread_count, s->deleted_count, s->junk_count, s->version));
+	
+	ptr32++; ptr32++;
+	s->msginfostart = ptr32;
+	
 	return 0;
 }
 
@@ -1687,41 +1711,79 @@
 	return (CamelMessageInfo *)mi;
 }
 
+static unsigned char*
+decode_uint32 (unsigned char *start, guint32 *dest)
+{
+ 	guint32 value = 0;
+	unsigned int uv = *start;
+	int v;
+
+        /* until we get the last byte, keep decoding 7 bits at a time */
+        while ( ((v = uv) & 0x80) == 0 ) {
+                value |= v;
+                value <<= 7;
+				start++; uv=*start;
+        }
+		*dest = value | (v & 0x7f);
+
+	return ++start;
+}
+
+
+
 static CamelMessageInfo *
 message_info_load(CamelFolderSummary *s, FILE *in)
 {
 	CamelMessageInfoBase *mi;
-	guint count;
+	guint count, len, nlen;
+	unsigned char *ptrchr = s->msginfostart;
+	time_t *ptrt;	
 	int i;
-	char *subject, *from, *to, *cc, *mlist, *uid;
 
 	mi = (CamelMessageInfoBase *)camel_message_info_new(s);
 
 	io(printf("Loading message info\n"));
 
-	camel_file_util_decode_string(in, &uid);
-	camel_file_util_decode_uint32(in, &mi->flags);
-	camel_file_util_decode_uint32(in, &mi->size);
-	camel_file_util_decode_time_t(in, &mi->date_sent);
-	camel_file_util_decode_time_t(in, &mi->date_received);
-	camel_file_util_decode_string(in, &subject);
-	camel_file_util_decode_string(in, &from);
-	camel_file_util_decode_string(in, &to);
-	camel_file_util_decode_string(in, &cc);
-	camel_file_util_decode_string(in, &mlist);
+	/* uid */
+	
+	ptrchr = decode_uint32 ((unsigned char*)ptrchr, &len);
+	io(printf ("uid len = %d, str=%s\n", len, ptrchr));
+	mi->uid = ptrchr;
+	ptrchr += len;
+	
+	ptrchr = decode_uint32 ((unsigned char*)ptrchr, &len);
+	mi->flags = ptrchr;
 
-	mi->uid = uid;
-	mi->subject = camel_pstring_strdup(subject);
-	mi->from = camel_pstring_strdup(from);
-	mi->to = camel_pstring_strdup(to);
-	mi->cc = camel_pstring_strdup(cc);
-	mi->mlist = camel_pstring_strdup(mlist);
+	ptrchr = decode_uint32 ((unsigned char*)ptrchr, &len);
+	mi->size = ptrchr;
+	
+	ptrt = ptrchr;
+	mi->date_sent = ptrt; ptrt++;
+	mi->date_received = ptrt; ptrt++;
 
-	g_free(subject);
-	g_free(from);
-	g_free(to);
-	g_free(cc);
-	g_free(mlist);
+	ptrchr = ptrt;
+	
+	ptrchr = decode_uint32 ((unsigned char*)ptrchr, &len);
+	mi->subject = ptrchr;
+	ptrchr += len;
+
+	ptrchr = decode_uint32 ((unsigned char*)ptrchr, &len);
+	mi->from = ptrchr;
+	ptrchr += len;
+
+	ptrchr = decode_uint32 ((unsigned char*)ptrchr, &len);
+	mi->to = ptrchr;
+	ptrchr += len;
+
+	ptrchr = decode_uint32 ((unsigned char*)ptrchr, &len);
+	mi->cc = ptrchr;
+	ptrchr += len;
+
+	ptrchr = decode_uint32 ((unsigned char*)ptrchr, &len);
+	mi->mlist = ptrchr;
+	ptrchr += len;
+
+/*
 
 	mi->content = NULL;
 
@@ -1765,6 +1827,7 @@
 	}
 
 	if (!ferror(in))
+	*/
 		return (CamelMessageInfo *)mi;
 
 error:
Index: camel-folder-summary.h
===================================================================
RCS file: /cvs/gnome/evolution-data-server/camel/camel-folder-summary.h,v
retrieving revision 1.86
diff -u -r1.86 camel-folder-summary.h
--- camel-folder-summary.h	31 Aug 2005 04:21:56 -0000	1.86
+++ camel-folder-summary.h	11 Jun 2006 13:48:11 -0000
@@ -224,6 +224,9 @@
 	GHashTable *messages_uid; /* CamelMessageInfo's by uid */
 
 	struct _CamelFolder *folder; /* parent folder, for events */
+
+	void *summary_mapping, *msginfostart;
+	FILE *summary_filedes;
 };
 
 struct _CamelFolderSummaryClass {


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]