[Evolution-hackers] Improved vCard parser



I'd like to focus on the vCard parser and export in libebook. Some
things I've noticed:
- v3.0 import is quite well supported, with the major exception of
charsets http://bugzilla.gnome.org/240756 tracks all the v3.0 bugs
- v2.1 import was well supported, but is gradually getting worse since
there is a common codebase for 2.1 and 3.0, and most people care about
3.0
- v3.0 export is good, with minor exceptions (CRLF at end of card for example)
- v2.1 export is non-existant
- performance is important, as vCards are used in the file-backend.
From a performance perspective, that's horrible, but it has its
advantages too. For large vCards, the poor performance easily kneeled
the system, for example with a medium sized photo. With
http://bugzilla.gnome.org/433782 it got much better, but there's still
a lot of potential for improvement

I've created a patch (against svn trunk) that improves the performance
of the parsing itself (only v3.0) and adds some other fixes like CRLF
at the end of the card. The patch is supposed to be non-intrusive, and
will not break public APIs, but mainly create new internal methods
that will only kick in for vCards with VERSION:3.0 in the second line.
Other vCards will be parsed as before.

After the patch, I created a test suite (attached archive) to test my
own patch and the current implementation. I used a different approach
than Ross in eds-dbus
(http://svn.o-hand.com/view/eds-dbus/trunk/addressbook/testsuite/).
Instead of creating classic hand-coded unit tests, I compare a parsed
file with a file that has the expected format. That way, new tests can
be added with much less effort, without writing any code. The downside
is that not all aspects of parsing can be tested. For example, if a
list separated with comma was read as one chunk, it probably wouldn't
detect that it should have been separated at the commas. Anyway, I
think the test suite makes sense and can be supplemented by classic
unit tests. Try it out and add more tests!

To run the v3.0 tests, for example, add the .vcf's in vcard/valid-3.0
as parameters to src/test-vcard-suite. -e outputs detailed error
messages and -r 100 will repeat the parsing 100 times for
benchmarking. A typical command I use is:
LD_LIBRARY_PATH=/opt/evolution-data-server/lib -e src/test-vcard-suite
vcard/**/*.vcf

A major weakness in the vCard 3.0 specification is its inability to
tag vCards in files with charset. The only ideal solution, as I see it
is to ask the user which charset he wants to use and maybe also
display a preview. For vCards in emails (anything MIME), the charset
can already be specified. What the parser needs is support for
converting from the specified charset. I added an extension of the
patch that does this too. It breaks the API and will need extensions
to the UI to be of any practical value.

For now, focus on the patch without charset support. Ross' patch at
http://bugzilla.gnome.org/312581 is related, and efforts should
ideally be joined. What we need is a strategy (Ross and Srini) of
where to end up. I see three possible roads:
- An optimised v3.0 parser with fallback to a "quirk-mode" parser for
v2.1 and buggy v3.0 (my patch goes down this road)
- One v2.1 and one v3.0
- One parser to rule them all, but has to be very, very clever to
maintain high performance and at the same time support
quoted-printable (basically what we have now, minus some performance)

Independent of the choice of strategy, there are a couple of obvious
spots to improve.
* Export is quite streamlined, but the method doing escaping can be improved
* Whenever e-d-s requires glib 2.12 anyway, maybe glib's base64 can be
used for improved performance(?) and reduced code
complexity/maintainability?

That was a lot, but the gist is "everyone, give some attention to the
vCard parser - improve it, test it or add test cases for whatever
doesn't work for you"!

Cheers,
Øystein
Index: e-vcard.c
===================================================================
--- e-vcard.c	(revisjon 7787)
+++ e-vcard.c	(arbeidskopi)
@@ -139,12 +139,12 @@
  *  newline characters if found, taking care of equal characters
  *  and other strange things.
  */
-static char*
-skip_newline (char *str, gboolean quoted_printable)
+static const char*
+skip_newline (const char *str, gboolean quoted_printable)
 {
-	char *p; 
-	char *next;
-	char *next2;
+	const char *p; 
+	const char *next;
+	const char *next2;
 	p = str;
 
 	/* -- swallow equal signs at end of line for quoted printable */
@@ -187,34 +187,29 @@
 
 /* skip forward until we hit the CRLF, or \0 */
 static void
-skip_to_next_line (char **p)
+skip_to_next_line (const char **p)
 {
-	char *lp;
-	lp = *p;
+	while (**p != '\n' && **p != '\r' && **p != '\0')
+		*p = g_utf8_next_char (*p);
 
-	while (*lp != '\n' && *lp != '\r' && *lp != '\0')
-		lp = g_utf8_next_char (lp);
-
 	/* -- skip over the endline */
-	while( *lp == '\r' || *lp == '\n' ) {
-		lp = g_utf8_next_char (lp);
+	while( **p == '\r' || **p == '\n' ) {
+		*p = g_utf8_next_char (*p);
 	}
-
-	*p = lp;
 }
 
 /* skip forward until we hit a character in @s, CRLF, or \0.  leave *p
    pointing at the character that causes us to stop */
 static void
-skip_until (char **p, char *s)
+skip_until (const char **p, const char *s)
 {
-	char *lp;
+	const char *lp;
 
 	lp = *p;
 
 	while (*lp != '\r' && *lp != '\0') {
 		gboolean s_matches = FALSE;
-		char *ls;
+		const char *ls;
 		for (ls = s; *ls; ls = g_utf8_next_char (ls)) {
 			if (g_utf8_get_char (ls) == g_utf8_get_char (lp)) {
 				s_matches = TRUE;
@@ -231,9 +226,9 @@
 }
 
 static void
-read_attribute_value (EVCardAttribute *attr, char **p, gboolean quoted_printable)
+read_attribute_value (EVCardAttribute *attr, const char **p, gboolean quoted_printable)
 {
-	char *lp = *p;
+	const char *lp = *p;
 	GString *str;
 
 	/* read in the value */
@@ -273,6 +268,7 @@
 				g_string_append_c (str, '\\');
 				break;
 			}
+			lp = skip_newline(lp, quoted_printable);
 			switch (*lp) {
 			case 'n': g_string_append_c (str, '\n'); break;
 			case 'N': g_string_append_c (str, '\n'); break;
@@ -310,10 +306,44 @@
 	*p = lp;
 }
 
+/* read the attribute value in p and add the value to attr */
+/* will read the rest of the line and leave p to point at the */
+/* end of line character */
 static void
-read_attribute_params (EVCardAttribute *attr, char **p, gboolean *quoted_printable)
+read_attribute_values (EVCardAttribute *attr, const gchar **p)
 {
-	char *lp;
+	GString *str = g_string_sized_new (strlen (*p));
+
+	while (**p != '\0') {
+		if (**p == ';' || **p == ',') {
+			attr->values = g_list_prepend (attr->values, g_string_free (str, FALSE));
+			str = g_string_sized_new (strlen (*p));
+		} else if (**p == '\\') {
+			*p = g_utf8_next_char (*p);
+			switch (**p) {
+			case 'n': g_string_insert_c (str, -1, '\n'); break;
+			case 'N': g_string_insert_c (str, -1, '\n'); break;
+			case ';': g_string_insert_c (str, -1, ';'); break;
+			case ',': g_string_insert_c (str, -1, ','); break;
+			case '\\': g_string_insert_c (str, -1, '\\'); break;
+			default:
+				g_warning ("invalid escape, passing it through");
+				str = g_string_insert_c (str, -1, '\\');
+				str = g_string_insert_unichar (str, -1, g_utf8_get_char(*p));
+				break;
+			}
+		} else
+			str = g_string_insert_unichar (str, -1, g_utf8_get_char (*p));
+		*p = g_utf8_next_char (*p);
+	}
+	attr->values = g_list_prepend (attr->values, g_string_free (str, FALSE));
+	attr->values = g_list_reverse (attr->values);
+}
+ 
+static void
+read_attribute_params (EVCardAttribute *attr, const char **p, gboolean *quoted_printable)
+{
+	const char *lp;
 	GString *str;
 	EVCardAttributeParam *param = NULL;
 	gboolean in_quote = FALSE;
@@ -457,18 +487,78 @@
 		g_string_free (str, TRUE);
 
 	*p = lp;
+
+	attr->params = g_list_reverse (attr->params);
 }
 
+/* Read one parameter exclusive the prepending semicolon */
+static void
+read_attribute_param (EVCardAttribute *attr, const gchar **p)
+{
+	EVCardAttributeParam *param = e_vcard_attribute_param_new (NULL);
+	const gchar *start = *p;
+
+	/* Parameter name */
+	while (**p != '\0') {
+		/* Append valid characters */
+		if (g_ascii_isalnum (**p) || **p == '-')
+			(*p)++;
+		/* End of parameter */
+		else if (**p == '=' || **p == ':' || **p == ';')
+			break;
+		else {
+			g_warning ("invalid character '%c' found in attribute parameter name", **p);
+			goto recover;
+		}
+	}
+	param->name = g_strndup (start, *p - start);
+
+	if (**p == '=') {
+		(*p)++;
+		do {
+			/* quoted-string */
+			if (**p == '"') {
+				(*p)++;
+				start = *p;
+				while (**p != '"') {
+					if (**p == '\0') {
+						g_warning ("invalid character NUL found in attribute parameter value");
+						goto recover;
+					}
+					*p = g_utf8_next_char (*p);
+				}
+				param->values = g_list_prepend (param->values, g_strndup (start, *p - start));
+				(*p)++;
+			}
+			/* ptext */
+			else {
+				start = *p;
+				while (**p != ';' && **p != ':' && **p != ',' && **p != '\0')
+					*p = g_utf8_next_char (*p);
+				param->values = g_list_prepend (param->values, g_strndup (start, *p - start));
+			}
+		} while (**p == ',' && (*p)++ != '\0');
+
+		param->values = g_list_reverse (param->values);
+		e_vcard_attribute_add_param (attr, param);
+		return;
+	}
+recover:
+	while (**p != ';' && **p != ':' && **p != '\0')
+		*p = g_utf8_next_char (*p);
+	e_vcard_attribute_param_free (param);
+}
+
 /* reads an entire attribute from the input buffer, leaving p pointing
    at the start of the next line (past the \r\n) */
 static EVCardAttribute*
-read_attribute (char **p)
+read_attribute (const char **p)
 {
 	char *attr_group = NULL;
 	char *attr_name = NULL;
 	EVCardAttribute *attr = NULL;
 	GString *str;
-	char *lp;
+	const char *lp;
 	gboolean is_qp = FALSE;
 
 	/* first read in the group/name */
@@ -557,42 +647,99 @@
 	return NULL;
 }
 
-/* Stolen from glib/glib/gconvert.c */
+/* Unfold one or more physical lines into one logical line */
+/* as described by RFC 2425 */
+/* Point p to the beginning of the next logical line and return */
+/* the unfolded line as a newly allocated string that must be freed */
 static gchar *
-make_valid_utf8 (const gchar *name)
+unfold_line (const gchar **p)
 {
-	GString *string;
-	const gchar *remainder, *invalid;
-	gint remaining_bytes, valid_bytes;
-  
-	string = NULL;
-  	remainder = name;
-	remaining_bytes = strlen (name);
-  
- 	while (remaining_bytes != 0) {
-		if (g_utf8_validate (remainder, remaining_bytes, &invalid)) 
+	const gchar *start;
+	GString *str = g_string_sized_new (16);
+
+	while (TRUE) {
+		start = *p;
+		while (**p != '\r' && **p != '\n' && **p != '\0')
+			*p = g_utf8_next_char (*p);
+		g_string_insert_len (str, -1, start, *p - start);
+		if (**p == '\r' || **p == '\n') {
+			*p = g_utf8_next_char (*p);
+			if (**p == '\r' || **p == '\n')
+				*p = g_utf8_next_char (*p);
+			if (**p == ' ' || **p == '\t') {
+				*p = g_utf8_next_char (*p);
+				continue;
+			}
+		}
+		break;
+	}
+
+	return g_string_free (str, FALSE);
+}
+
+static EVCardAttribute*
+read_attribute_optimised (const gchar **p)
+{
+	EVCardAttribute *attr = e_vcard_attribute_new (NULL, NULL);
+	const gchar *start;
+	gchar *lp, *lp_start;
+
+	/* Unfold line */
+	lp = lp_start = unfold_line (p);
+
+	start = lp;
+	/* Capture name (and group) */
+	while (*lp != '\0') {
+		/* Valid character */
+		if (g_ascii_isalnum (*lp) || *lp == '-')
+			;
+		/* End of name */
+		else if (*lp == ':' || *lp == ';')
+			/* we've got a name, break out to the value/attribute parsing */
 			break;
-	      valid_bytes = invalid - remainder;
-    
-		if (string == NULL) 
-			string = g_string_sized_new (remaining_bytes);
+		/* End of group */
+		else if (*lp == '.') {
+			if (attr->group == NULL)
+				attr->group = g_strndup (start, lp - start);
+			else
+				g_warning ("extra `.' in attribute specification.  ignoring extra group");
+			start = ++lp;
+			continue;
+		}
+		else {
+			g_warning ("invalid character '%c' found in attribute group/name", *lp);
+			goto recover;
+		}
 
-		g_string_append_len (string, remainder, valid_bytes);
-	        /* append U+FFFD REPLACEMENT CHARACTER */
-	        g_string_append (string, "\357\277\275");
-      
-	        remaining_bytes -= valid_bytes + 1;
-	        remainder = invalid + 1;
+		lp++;
 	}
-  
-  	if (string == NULL)
-    		return g_strdup (name);
-  
-	g_string_append (string, remainder);
 
-        g_assert (g_utf8_validate (string->str, -1, NULL));
-  
-        return g_string_free (string, FALSE);
+	if ((lp - start) == 0) {
+		/* since we don't have an attribute name, ignore this line */
+		g_warning ("ignoring atribute with empty name");
+		goto recover;
+	}
+
+	attr->name = g_strndup (start, lp - start);
+
+	/* Attribute parameters */
+	while (*lp == ';') {
+		lp++;
+		read_attribute_param (attr, (const gchar **) &lp);
+	}
+	attr->params = g_list_reverse (attr->params);
+
+	/* Values */
+	if (*lp == ':') {
+		lp++;
+		read_attribute_values (attr, (const gchar **) &lp);
+		g_free (lp_start);
+		return attr;
+	}
+ recover:
+	g_free (lp_start);
+	e_vcard_attribute_free (attr);
+	return NULL;
 }
 
 /* we try to be as forgiving as we possibly can here - this isn't a
@@ -602,22 +749,9 @@
 static void
 parse (EVCard *evc, const char *str)
 {
-	char *buf ;
-	char *p;
 	EVCardAttribute *attr;
 	
-	buf = make_valid_utf8 (str);
-
-	//buf = fold_lines (buf);
-
-	d(printf ("BEFORE FOLDING:\n"));
-	d(printf (str));
-	d(printf ("\n\nAFTER FOLDING:\n"));
-	d(printf (buf));
-
-	p = buf;
-
-	attr = read_attribute (&p);
+	attr = read_attribute (&str);
 	if (!attr || attr->group || g_ascii_strcasecmp (attr->name, "begin")) {
 		g_warning ("vcard began without a BEGIN:VCARD\n");
 	}
@@ -626,25 +760,29 @@
 	else if (attr)
 		e_vcard_add_attribute (evc, attr);
 
-	while (*p) {
-		EVCardAttribute *next_attr = read_attribute (&p);
-
-		if (next_attr) {
-			if (g_ascii_strcasecmp (next_attr->name, "end"))
-				e_vcard_add_attribute (evc, next_attr);
-			attr = next_attr;
+	/* Peek at VERSION attribute to identify v3 for optimised parsing */
+	if (g_ascii_strncasecmp (str, "VERSION:3.0", 11) == 0) {
+		while ((attr = read_attribute_optimised (&str))) {
+			if (g_ascii_strcasecmp (attr->name, "end") != 0)
+				e_vcard_add_attribute (evc, attr);
+			else
+				break;
 		}
+	} else {
+		while ((attr = read_attribute (&str))) {
+			if (g_ascii_strcasecmp (attr->name, "end") != 0)
+				e_vcard_add_attribute (evc, attr);
+			else
+				break;
+		}
 	}
 
-	if (!attr || attr->group || g_ascii_strcasecmp (attr->name, "end")) {
+	if (!attr || attr->group || g_ascii_strcasecmp (attr->name, "end"))
 		g_warning ("vcard ended without END:VCARD\n");
-	}
 
 	if (attr && !g_ascii_strcasecmp (attr->name, "end"))
 		e_vcard_attribute_free (attr);
 
-	g_free (buf);
-
 	evc->priv->attributes = g_list_reverse (evc->priv->attributes);
 }
 
@@ -796,14 +934,10 @@
 	GList *l;
 	GList *v;
 
-	GString *str = g_string_new ("");
-
-	g_string_append (str, "BEGIN:VCARD" CRLF);
-
 	/* we hardcode the version (since we're outputting to a
 	   specific version) and ignore any version attributes the
 	   vcard might contain */
-	g_string_append (str, "VERSION:3.0" CRLF);
+	GString *str = g_string_new ("BEGIN:VCARD" CRLF "VERSION:3.0" CRLF);
 
 	for (l = evc->priv->attributes; l; l = l->next) {
 		GList *list;
@@ -814,7 +948,7 @@
 		if (!g_ascii_strcasecmp (attr->name, "VERSION"))
 			continue;
 
-		attr_str = g_string_new ("");
+		attr_str = g_string_sized_new (16);
 
 		/* From rfc2425, 5.8.2
 		 *
@@ -862,12 +996,15 @@
 		g_string_append_c (attr_str, ':');
 
 		for (v = attr->values; v; v = v->next) {
-			char *value = v->data;
-			char *escaped_value = NULL;
+			if (attr->encoding_set && attr->encoding == EVC_ENCODING_BASE64)
+				g_string_append (attr_str, v->data);
+			else {
+				char *escaped_value = e_vcard_escape_string (v->data);
 
-			escaped_value = e_vcard_escape_string (value);
+				g_string_append (attr_str, escaped_value);
+				g_free (escaped_value);
+			}
 
-			g_string_append (attr_str, escaped_value);
 			if (v->next) {
 				/* XXX toshok - i hate you, rfc 2426.
 				   why doesn't CATEGORIES use a ; like
@@ -877,8 +1014,6 @@
 				else
 					g_string_append_c (attr_str, ';');
 			}
-
-			g_free (escaped_value);
 		}
 
 		/* 5.8.2:
@@ -908,7 +1043,7 @@
 		g_string_free (attr_str, TRUE);
 	}
 
-	g_string_append (str, "END:VCARD");
+	g_string_append (str, "END:VCARD" CRLF);
 
 	return g_string_free (str, FALSE);
 }
Index: e-vcard.c
===================================================================
--- e-vcard.c	(revisjon 7787)
+++ e-vcard.c	(arbeidskopi)
@@ -139,12 +139,12 @@
  *  newline characters if found, taking care of equal characters
  *  and other strange things.
  */
-static char*
-skip_newline (char *str, gboolean quoted_printable)
+static const char*
+skip_newline (const char *str, gboolean quoted_printable)
 {
-	char *p; 
-	char *next;
-	char *next2;
+	const char *p; 
+	const char *next;
+	const char *next2;
 	p = str;
 
 	/* -- swallow equal signs at end of line for quoted printable */
@@ -187,34 +187,29 @@
 
 /* skip forward until we hit the CRLF, or \0 */
 static void
-skip_to_next_line (char **p)
+skip_to_next_line (const char **p)
 {
-	char *lp;
-	lp = *p;
+	while (**p != '\n' && **p != '\r' && **p != '\0')
+		*p = g_utf8_next_char (*p);
 
-	while (*lp != '\n' && *lp != '\r' && *lp != '\0')
-		lp = g_utf8_next_char (lp);
-
 	/* -- skip over the endline */
-	while( *lp == '\r' || *lp == '\n' ) {
-		lp = g_utf8_next_char (lp);
+	while( **p == '\r' || **p == '\n' ) {
+		*p = g_utf8_next_char (*p);
 	}
-
-	*p = lp;
 }
 
 /* skip forward until we hit a character in @s, CRLF, or \0.  leave *p
    pointing at the character that causes us to stop */
 static void
-skip_until (char **p, char *s)
+skip_until (const char **p, const char *s)
 {
-	char *lp;
+	const char *lp;
 
 	lp = *p;
 
 	while (*lp != '\r' && *lp != '\0') {
 		gboolean s_matches = FALSE;
-		char *ls;
+		const char *ls;
 		for (ls = s; *ls; ls = g_utf8_next_char (ls)) {
 			if (g_utf8_get_char (ls) == g_utf8_get_char (lp)) {
 				s_matches = TRUE;
@@ -231,9 +226,9 @@
 }
 
 static void
-read_attribute_value (EVCardAttribute *attr, char **p, gboolean quoted_printable)
+read_attribute_value (EVCardAttribute *attr, const char **p, gboolean quoted_printable)
 {
-	char *lp = *p;
+	const char *lp = *p;
 	GString *str;
 
 	/* read in the value */
@@ -273,6 +268,7 @@
 				g_string_append_c (str, '\\');
 				break;
 			}
+			lp = skip_newline(lp, quoted_printable);
 			switch (*lp) {
 			case 'n': g_string_append_c (str, '\n'); break;
 			case 'N': g_string_append_c (str, '\n'); break;
@@ -310,10 +306,44 @@
 	*p = lp;
 }
 
+/* read the attribute value in p and add the value to attr */
+/* will read the rest of the line and leave p to point at the */
+/* end of line character */
 static void
-read_attribute_params (EVCardAttribute *attr, char **p, gboolean *quoted_printable)
+read_attribute_values (EVCardAttribute *attr, const gchar **p)
 {
-	char *lp;
+	GString *str = g_string_sized_new (strlen (*p));
+
+	while (**p != '\0') {
+		if (**p == ';' || **p == ',') {
+			attr->values = g_list_prepend (attr->values, g_string_free (str, FALSE));
+			str = g_string_sized_new (strlen (*p));
+		} else if (**p == '\\') {
+			*p = g_utf8_next_char (*p);
+			switch (**p) {
+			case 'n': g_string_insert_c (str, -1, '\n'); break;
+			case 'N': g_string_insert_c (str, -1, '\n'); break;
+			case ';': g_string_insert_c (str, -1, ';'); break;
+			case ',': g_string_insert_c (str, -1, ','); break;
+			case '\\': g_string_insert_c (str, -1, '\\'); break;
+			default:
+				g_warning ("invalid escape, passing it through");
+				str = g_string_insert_c (str, -1, '\\');
+				str = g_string_insert_unichar (str, -1, g_utf8_get_char(*p));
+				break;
+			}
+		} else
+			str = g_string_insert_unichar (str, -1, g_utf8_get_char (*p));
+		*p = g_utf8_next_char (*p);
+	}
+	attr->values = g_list_prepend (attr->values, g_string_free (str, FALSE));
+	attr->values = g_list_reverse (attr->values);
+}
+ 
+static void
+read_attribute_params (EVCardAttribute *attr, const char **p, gboolean *quoted_printable)
+{
+	const char *lp;
 	GString *str;
 	EVCardAttributeParam *param = NULL;
 	gboolean in_quote = FALSE;
@@ -457,18 +487,78 @@
 		g_string_free (str, TRUE);
 
 	*p = lp;
+
+	attr->params = g_list_reverse (attr->params);
 }
 
+/* Read one parameter exclusive the prepending semicolon */
+static void
+read_attribute_param (EVCardAttribute *attr, const gchar **p)
+{
+	EVCardAttributeParam *param = e_vcard_attribute_param_new (NULL);
+	const gchar *start = *p;
+
+	/* Parameter name */
+	while (**p != '\0') {
+		/* Append valid characters */
+		if (g_ascii_isalnum (**p) || **p == '-')
+			(*p)++;
+		/* End of parameter */
+		else if (**p == '=' || **p == ':' || **p == ';')
+			break;
+		else {
+			g_warning ("invalid character '%c' found in attribute parameter name", **p);
+			goto recover;
+		}
+	}
+	param->name = g_strndup (start, *p - start);
+
+	if (**p == '=') {
+		(*p)++;
+		do {
+			/* quoted-string */
+			if (**p == '"') {
+				(*p)++;
+				start = *p;
+				while (**p != '"') {
+					if (**p == '\0') {
+						g_warning ("invalid character NUL found in attribute parameter value");
+						goto recover;
+					}
+					*p = g_utf8_next_char (*p);
+				}
+				param->values = g_list_prepend (param->values, g_strndup (start, *p - start));
+				(*p)++;
+			}
+			/* ptext */
+			else {
+				start = *p;
+				while (**p != ';' && **p != ':' && **p != ',' && **p != '\0')
+					*p = g_utf8_next_char (*p);
+				param->values = g_list_prepend (param->values, g_strndup (start, *p - start));
+			}
+		} while (**p == ',' && (*p)++ != '\0');
+
+		param->values = g_list_reverse (param->values);
+		e_vcard_attribute_add_param (attr, param);
+		return;
+	}
+recover:
+	while (**p != ';' && **p != ':' && **p != '\0')
+		*p = g_utf8_next_char (*p);
+	e_vcard_attribute_param_free (param);
+}
+
 /* reads an entire attribute from the input buffer, leaving p pointing
    at the start of the next line (past the \r\n) */
 static EVCardAttribute*
-read_attribute (char **p)
+read_attribute (const char **p)
 {
 	char *attr_group = NULL;
 	char *attr_name = NULL;
 	EVCardAttribute *attr = NULL;
 	GString *str;
-	char *lp;
+	const char *lp;
 	gboolean is_qp = FALSE;
 
 	/* first read in the group/name */
@@ -557,42 +647,99 @@
 	return NULL;
 }
 
-/* Stolen from glib/glib/gconvert.c */
+/* Unfold one or more physical lines into one logical line */
+/* as described by RFC 2425 */
+/* Point p to the beginning of the next logical line and return */
+/* the unfolded line as a newly allocated string that must be freed */
 static gchar *
-make_valid_utf8 (const gchar *name)
+unfold_line (const gchar **p)
 {
-	GString *string;
-	const gchar *remainder, *invalid;
-	gint remaining_bytes, valid_bytes;
-  
-	string = NULL;
-  	remainder = name;
-	remaining_bytes = strlen (name);
-  
- 	while (remaining_bytes != 0) {
-		if (g_utf8_validate (remainder, remaining_bytes, &invalid)) 
+	const gchar *start;
+	GString *str = g_string_sized_new (16);
+
+	while (TRUE) {
+		start = *p;
+		while (**p != '\r' && **p != '\n' && **p != '\0')
+			*p = g_utf8_next_char (*p);
+		g_string_insert_len (str, -1, start, *p - start);
+		if (**p == '\r' || **p == '\n') {
+			*p = g_utf8_next_char (*p);
+			if (**p == '\r' || **p == '\n')
+				*p = g_utf8_next_char (*p);
+			if (**p == ' ' || **p == '\t') {
+				*p = g_utf8_next_char (*p);
+				continue;
+			}
+		}
+		break;
+	}
+
+	return g_string_free (str, FALSE);
+}
+
+static EVCardAttribute*
+read_attribute_optimised (const gchar **p)
+{
+	EVCardAttribute *attr = e_vcard_attribute_new (NULL, NULL);
+	const gchar *start;
+	gchar *lp, *lp_start;
+
+	/* Unfold line */
+	lp = lp_start = unfold_line (p);
+
+	start = lp;
+	/* Capture name (and group) */
+	while (*lp != '\0') {
+		/* Valid character */
+		if (g_ascii_isalnum (*lp) || *lp == '-')
+			;
+		/* End of name */
+		else if (*lp == ':' || *lp == ';')
+			/* we've got a name, break out to the value/attribute parsing */
 			break;
-	      valid_bytes = invalid - remainder;
-    
-		if (string == NULL) 
-			string = g_string_sized_new (remaining_bytes);
+		/* End of group */
+		else if (*lp == '.') {
+			if (attr->group == NULL)
+				attr->group = g_strndup (start, lp - start);
+			else
+				g_warning ("extra `.' in attribute specification.  ignoring extra group");
+			start = ++lp;
+			continue;
+		}
+		else {
+			g_warning ("invalid character '%c' found in attribute group/name", *lp);
+			goto recover;
+		}
 
-		g_string_append_len (string, remainder, valid_bytes);
-	        /* append U+FFFD REPLACEMENT CHARACTER */
-	        g_string_append (string, "\357\277\275");
-      
-	        remaining_bytes -= valid_bytes + 1;
-	        remainder = invalid + 1;
+		lp++;
 	}
-  
-  	if (string == NULL)
-    		return g_strdup (name);
-  
-	g_string_append (string, remainder);
 
-        g_assert (g_utf8_validate (string->str, -1, NULL));
-  
-        return g_string_free (string, FALSE);
+	if ((lp - start) == 0) {
+		/* since we don't have an attribute name, ignore this line */
+		g_warning ("ignoring atribute with empty name");
+		goto recover;
+	}
+
+	attr->name = g_strndup (start, lp - start);
+
+	/* Attribute parameters */
+	while (*lp == ';') {
+		lp++;
+		read_attribute_param (attr, (const gchar **) &lp);
+	}
+	attr->params = g_list_reverse (attr->params);
+
+	/* Values */
+	if (*lp == ':') {
+		lp++;
+		read_attribute_values (attr, (const gchar **) &lp);
+		g_free (lp_start);
+		return attr;
+	}
+ recover:
+	g_free (lp_start);
+	e_vcard_attribute_free (attr);
+	return NULL;
 }
 
 /* we try to be as forgiving as we possibly can here - this isn't a
@@ -600,24 +747,23 @@
  * try to return *something*.
  */
 static void
-parse (EVCard *evc, const char *str)
+parse (EVCard *evc, const char *src_str, const char *charset)
 {
-	char *buf ;
-	char *p;
 	EVCardAttribute *attr;
-	
-	buf = make_valid_utf8 (str);
+	const gchar *str = src_str;
 
-	//buf = fold_lines (buf);
+	if (charset != NULL) {
+		GError *err = NULL;
 
-	d(printf ("BEFORE FOLDING:\n"));
-	d(printf (str));
-	d(printf ("\n\nAFTER FOLDING:\n"));
-	d(printf (buf));
+		str = g_convert (src_str, -1, "UTF-8", charset, NULL, NULL, &err);
+		if (str == NULL) {
+			g_warning ("Unable to convert VCard from charset %s: '%s'"
+				   ", falling back to UTF-8", charset, err->message);
+			str = src_str;
+		}
+	}
 
-	p = buf;
-
-	attr = read_attribute (&p);
+	attr = read_attribute (&str);
 	if (!attr || attr->group || g_ascii_strcasecmp (attr->name, "begin")) {
 		g_warning ("vcard began without a BEGIN:VCARD\n");
 	}
@@ -626,25 +772,29 @@
 	else if (attr)
 		e_vcard_add_attribute (evc, attr);
 
-	while (*p) {
-		EVCardAttribute *next_attr = read_attribute (&p);
-
-		if (next_attr) {
-			if (g_ascii_strcasecmp (next_attr->name, "end"))
-				e_vcard_add_attribute (evc, next_attr);
-			attr = next_attr;
+	/* Peek at VERSION attribute to identify v3 for optimised parsing */
+	if (g_ascii_strncasecmp (str, "VERSION:3.0", 11) == 0) {
+		while ((attr = read_attribute_optimised (&str))) {
+			if (g_ascii_strcasecmp (attr->name, "end") != 0)
+				e_vcard_add_attribute (evc, attr);
+			else
+				break;
 		}
+	} else {
+		while ((attr = read_attribute (&str))) {
+			if (g_ascii_strcasecmp (attr->name, "end") != 0)
+				e_vcard_add_attribute (evc, attr);
+			else
+				break;
+		}
 	}
 
-	if (!attr || attr->group || g_ascii_strcasecmp (attr->name, "end")) {
+	if (!attr || attr->group || g_ascii_strcasecmp (attr->name, "end"))
 		g_warning ("vcard ended without END:VCARD\n");
-	}
 
 	if (attr && !g_ascii_strcasecmp (attr->name, "end"))
 		e_vcard_attribute_free (attr);
 
-	g_free (buf);
-
 	evc->priv->attributes = g_list_reverse (evc->priv->attributes);
 }
 
@@ -744,9 +894,19 @@
 	g_return_if_fail (str != NULL);
 
 	if (*str)
-		parse (evc, str);
+		parse (evc, str, NULL);
 }
 
+void
+e_vcard_construct_with_charset (EVCard *evc, const char *str, const char *charset)
+{
+	g_return_if_fail (E_IS_VCARD (evc));
+	g_return_if_fail (str != NULL);
+
+	if (*str)
+		parse (evc, str, charset);
+}
+
 /**
  * e_vcard_new:
  * 
@@ -783,6 +943,31 @@
 	return evc;
 }
 
+/**
+ * e_vcard_new_from_string_with_charset:
+ * @str: a string representation of the vcard to create
+ * @charset: name of str's charset
+ *
+ * Creates a new #EVCard from the passed-in string
+ * representation. The string will be converted from
+ * the specified charset.
+ *
+ * Return value: A new #EVCard.
+ **/
+EVCard *
+e_vcard_new_from_string_with_charset (const char *str, const char *charset)
+{
+	EVCard *evc;
+
+	g_return_val_if_fail (str != NULL, NULL);
+
+	evc = g_object_new (E_TYPE_VCARD, NULL);
+
+	e_vcard_construct_with_charset (evc, str, charset);
+
+	return evc;
+}
+
 static char*
 e_vcard_to_string_vcard_21  (EVCard *evc)
 {
@@ -796,14 +981,10 @@
 	GList *l;
 	GList *v;
 
-	GString *str = g_string_new ("");
-
-	g_string_append (str, "BEGIN:VCARD" CRLF);
-
 	/* we hardcode the version (since we're outputting to a
 	   specific version) and ignore any version attributes the
 	   vcard might contain */
-	g_string_append (str, "VERSION:3.0" CRLF);
+	GString *str = g_string_new ("BEGIN:VCARD" CRLF "VERSION:3.0" CRLF);
 
 	for (l = evc->priv->attributes; l; l = l->next) {
 		GList *list;
@@ -814,7 +995,7 @@
 		if (!g_ascii_strcasecmp (attr->name, "VERSION"))
 			continue;
 
-		attr_str = g_string_new ("");
+		attr_str = g_string_sized_new (16);
 
 		/* From rfc2425, 5.8.2
 		 *
@@ -862,12 +1043,15 @@
 		g_string_append_c (attr_str, ':');
 
 		for (v = attr->values; v; v = v->next) {
-			char *value = v->data;
-			char *escaped_value = NULL;
+			if (attr->encoding_set && attr->encoding == EVC_ENCODING_BASE64)
+				g_string_append (attr_str, v->data);
+			else {
+				char *escaped_value = e_vcard_escape_string (v->data);
 
-			escaped_value = e_vcard_escape_string (value);
+				g_string_append (attr_str, escaped_value);
+				g_free (escaped_value);
+			}
 
-			g_string_append (attr_str, escaped_value);
 			if (v->next) {
 				/* XXX toshok - i hate you, rfc 2426.
 				   why doesn't CATEGORIES use a ; like
@@ -877,8 +1061,6 @@
 				else
 					g_string_append_c (attr_str, ';');
 			}
-
-			g_free (escaped_value);
 		}
 
 		/* 5.8.2:
@@ -908,7 +1090,7 @@
 		g_string_free (attr_str, TRUE);
 	}
 
-	g_string_append (str, "END:VCARD");
+	g_string_append (str, "END:VCARD" CRLF);
 
 	return g_string_free (str, FALSE);
 }
Index: e-vcard.h
===================================================================
--- e-vcard.h	(revisjon 7787)
+++ e-vcard.h	(arbeidskopi)
@@ -127,8 +127,10 @@
 GType   e_vcard_get_type                     (void);
 
 void    e_vcard_construct                    (EVCard *evc, const char *str);
+void    e_vcard_construct_with_charset       (EVCard *evc, const char *str, const char *charset);
 EVCard* e_vcard_new                          (void);
 EVCard* e_vcard_new_from_string              (const char *str);
+EVCard* e_vcard_new_from_string_with_charset (const char *str, const char *charset);
 
 char*   e_vcard_to_string                    (EVCard *evc, EVCardFormat format);
 

Attachment: eds-vcard-test.tar.bz2
Description: BZip2 compressed data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]