Re: [evolution-patches] patch to work around broken encoded-words that cannot be parsed as rfc822 atom tokens

From: Not Zed <notzed ximian com>
To: Jeffrey Stedfast <fejj novell com>
Cc: evolution-patches ximian com
Subject: Re: [evolution-patches] patch to work around broken encoded-words that cannot be parsed as rfc822 atom tokens
Date: Tue, 01 Mar 2005 09:26:22 +0800

This will break the address test data in tests/lib/address-data.h, or it should.

Did you do a make check from camel/?

Otherwise i guess it looks "ok". I think you should just use two tables for the test data, not have a ctext bit for decode ctext, and have them separate test cases (i.e. separate loops with separate test_start's). You should also use test_free() not g_free() in all test case code.

Rolled on another technical decision, I really couldn't give a flying whats-it anymore.

On Mon, 2005-02-28 at 16:26 -0500, Jeffrey Stedfast wrote:

new patch attached

I've unified the append_*() usage, switched to using memchr() for places
that could benefit from it, and added a test-suite

fwiw, I checked your example against mozilla-mail's implementation and
it results in the same garbage as evolution (with my patch)... but yea,
I don't know what to do about that. clearly your example should not be
decoded, but users want the following to work too:

Subject: =?ISO-8859-1?B?SWYgeW91IGNh
	biByZWFkIHRoaXMgeW8=?=

so it seems we either follow the spec like we do now and get everything
right (as defined by the way things are supposed to work according to
the rfc) or we apply my patch and get some stuff wrong but make users
quit their belly aching :)

I'm OK with either - e.g. I'm not going to take it personal if this
patch doesn't ever make it in. perhaps this patch could be put someplace
for people to apply if they really feel strongly about communicating
with their friend's broken mailers and leave it at that.

I really don't like the idea of not handling valid cases in favour of
handling broken ones

Jeff

On Mon, 2005-02-28 at 12:22 +0800, Not Zed wrote:
> 
> So, how many valid mails does this break?
> 
> e.g.
> 
> =?iso-8859-1?b?foo this is a hidden message not for evolution users
> bar?=
> 
> 
> And why did you copy the append_ functions?  There should be only one
> of each in the code.  This is a messy enough hack as it is.
> 
> Instead of a test folder, any test data must be added to the
> regression test we already have for rfc2047 decoding.
> 
> BTW, using if (p = memchr(inptr, '=', end-start-2) && p[1]  == '?') is
> highly likely to be much more efficient than (while (inptr < inend-2
> && !strcmp("=?")) inptr++))
> 
> On Fri, 2005-02-25 at 13:47 -0500, Jeffrey Stedfast wrote: 
> > unfortunately, there exist a bountiful number of shitful mailers out
> > there who's authors clearly couldn't be bothered to read or understand
> > the MIME specifications and so just pulled an encoding scheme out of
> > their proverbials.
> > 
> > the attached patch tries to deal with the scenarios that I'm aware of,
> > namely illegal characters in the encoded text portion of the
> > encoded-word token (which, sadly, even includes SPACE and TAB)
> > 
> > for the convenience of anyone reading this message who hasn't yet read
> > rfc2047, here's a good quote from the end of section 2:
> > 
> >    IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
> >    by an RFC 822 parser.  As a consequence, unencoded white space
> >    characters (such as SPACE and HTAB) are FORBIDDEN within an
> >    'encoded-word'.  For example, the character sequence
> > 
> >       =?iso-8859-1?q?this is some text?=
> > 
> >    would be parsed as four 'atom's, rather than as a single 'atom' (by
> >    an RFC 822 parser) or 'encoded-word' (by a parser which understands
> >    'encoded-words').  The correct way to encode the string "this is some
> >    text" is to encode the SPACE characters as well, e.g.
> > 
> >       =?iso-8859-1?q?this=20is=20some=20text?=
> > 
> > so yes, the behaviour of the broken mailers is explicitly FORBIDDEN but
> > that hasn't stopped them. oh well.
> > 
> > 
> > I've also attached a test mbox for everyone's convenience in testing
> > this patch (feel free to add to it)
> > 
> > With the patch applied, both Mozilla-Mail and Evolution render the
> > subjects (and other headers) exactly the same afaict.
> >

text/plain attachment (broken-rfc2047.patch), ""

? broken-rfc2047.patch
? camel-mime-tables.c
? providers/imap4/imap4-XGWMOVE.patch
? providers/imap4/imap4.patch
? tests/message/test4
? tests/misc/rfc2047
? tests/misc/test2
? tests/misc/url-scan
Index: ChangeLog
===================================================================
RCS file: /cvs/gnome/evolution-data-server/camel/ChangeLog,v
retrieving revision 1.2432
diff -u -p -r1.2432 ChangeLog
--- ChangeLog	25 Feb 2005 03:49:26 -0000	1.2432
+++ ChangeLog	28 Feb 2005 21:17:52 -0000
@@ -1,3 +1,13 @@
+2005-02-25  Jeffrey Stedfast  <fejj novell com>
+
+	* camel-mime-utils.c (quoted_decode): Allow spaces in the text we
+	are decoding.
+	(append_quoted_pair): Changed to take charset params and convert
+	un-quoted-pair'd strings to UTF-8.
+	(header_decode_text): Rewritten to work around broken rfc2047
+	encoded-words sent by mailers who's authors couldn't be bothered
+	to read the specs.
+
 2005-02-24  Not Zed  <NotZed Ximian com>
 
 	** See bug #68459
Index: camel-mime-utils.c
===================================================================
RCS file: /cvs/gnome/evolution-data-server/camel/camel-mime-utils.c,v
retrieving revision 1.223
diff -u -p -r1.223 camel-mime-utils.c
--- camel-mime-utils.c	31 Jan 2005 06:56:28 -0000	1.223
+++ camel-mime-utils.c	28 Feb 2005 21:17:52 -0000
@@ -814,8 +814,12 @@ quoted_decode(const unsigned char *in, s
 			*outptr++ = 0x20;
 		} else if (c==' ' || c==0x09) {
 			/* FIXME: this is an error! ignore for now ... */
+#if ADHERE_TO_SPEC
 			ret = -1;
 			break;
+#else
+			*outptr++ = c;
+#endif
 		} else {
 			*outptr++ = c;
 		}
@@ -915,7 +919,7 @@ rfc2047_decode_word(const char *in, size
 
 	/* quick check to see if this could possibly be a real encoded word */
 	if (len < 8 || !(in[0] == '=' && in[1] == '?' && in[len-1] == '=' && in[len-2] == '?')) {
-		d(printf("invalid\n"));
+		d(printf("rfc2047_decode_word: invalid token\n"));
 		return NULL;
 	}
 	
@@ -1058,31 +1062,51 @@ append_8bit (GString *out, const char *i
 	
 }
 
-static GString *
-append_quoted_pair (GString *str, const char *in, gssize inlen)
+static void
+append_atom (GString *str, const char *in, ssize_t inlen, const char *default_charset, const char *locale_charset)
+{
+	g_string_append_len (str, in, inlen);
+}
+
+static void
+append_text (GString *str, const char *in, ssize_t inlen, const char *default_charset, const char *locale_charset)
+{
+	if ((default_charset == NULL || !append_8bit (str, in, inlen, default_charset))
+	    && (locale_charset == NULL || !append_8bit (str, in, inlen, locale_charset)))
+		append_latin1 (str, in, inlen);
+}
+
+static void
+append_quoted_pair (GString *str, const char *in, ssize_t inlen, const char *default_charset, const char *locale_charset)
 {
 	register const char *inptr = in;
 	const char *inend = in + inlen;
+	GString *unquoted;
 	char c;
 	
+	unquoted = g_string_new ("");
+	
 	while (inptr < inend) {
 		c = *inptr++;
 		if (c == '\\' && inptr < inend)
-			g_string_append_c (str, *inptr++);
+			g_string_append_c (unquoted, *inptr++);
 		else
-			g_string_append_c (str, c);
+			g_string_append_c (unquoted, c);
 	}
-
-	return str;
+	
+	append_text (str, unquoted->str, unquoted->len, default_charset, locale_charset);
+	g_string_free (unquoted, TRUE);
 }
 
+#ifdef ADHERE_TO_SPEC
+
 /* decodes a simple text, rfc822 + rfc2047 */
 static char *
 header_decode_text (const char *in, size_t inlen, int ctext, const char *default_charset)
 {
 	GString *out;
 	const char *inptr, *inend, *start, *chunk, *locale_charset;
-	GString *(* append) (GString *, const char *, gssize);
+	void (* append) (GString *, const char *, ssize_t, const char * const char *);
 	char *dword = NULL;
 	guint32 mask;
 	
@@ -1093,7 +1117,7 @@ header_decode_text (const char *in, size
 		append = append_quoted_pair;
 	} else {
 		mask = (CAMEL_MIME_IS_LWSP);
-		append = g_string_append_len;
+		append = append_atom;
 	}
 	
 	out = g_string_new ("");
@@ -1107,10 +1131,10 @@ header_decode_text (const char *in, size
 			inptr++;
 
 		if (inptr == inend) {
-			append (out, start, inptr - start);
+			append (out, start, inptr - start, default_charset, locale_charset);
 			break;
 		} else if (dword == NULL) {
-			append (out, start, inptr - start);
+			append (out, start, inptr - start, default_charset, locale_charset);
 		} else {
 			chunk = start;
 		}
@@ -1127,9 +1151,7 @@ header_decode_text (const char *in, size
 			if (!chunk)
 				chunk = start;
 			
-			if ((default_charset == NULL || !append_8bit (out, chunk, inptr-chunk, default_charset))
-			    && (locale_charset == NULL || !append_8bit(out, chunk, inptr-chunk, locale_charset)))
-				append_latin1(out, chunk, inptr-chunk);
+			append_text (out, chunk, inptr - chunk, default_charset, locale_charset);
 		}
 		
 		chunk = NULL;
@@ -1140,6 +1162,81 @@ header_decode_text (const char *in, size
 	
 	return dword;
 }
+
+#else /* ! ADHERE_TO_SPEC */
+
+static char *
+header_decode_text (const char *in, size_t inlen, int ctext, const char *default_charset)
+{
+	void (* append) (GString *, const char *, ssize_t, const char *, const char *);
+	const char *inptr, *inend, *start, *encword, *locale_charset;
+	char *dword = NULL;
+	GString *out;
+	
+	locale_charset = e_iconv_locale_charset ();
+	
+	if (ctext)
+		append = append_quoted_pair;
+	else
+		append = append_text;
+	
+	out = g_string_new ("");
+	inptr = in;
+	inend = inptr + inlen;
+	
+	while (inptr < inend) {
+		start = inptr;
+		
+		while (inptr < (inend - 8) && strncmp (inptr, "=?", 2) != 0) {
+			if (!camel_mime_is_lwsp (*inptr))
+				dword = NULL;
+			inptr++;
+		}
+		
+		if (inptr >= (inend - 8)) {
+			append (out, start, inend - start, default_charset, locale_charset);
+			break;
+		}
+		
+		/* could be an encoded word (or a broken encoded word which is why this code is so damn hairy) */
+		encword = inptr;
+		
+		inptr += 2;
+		inptr = memchr (inptr, '?', (inend - 4) - inptr);
+		if (inptr && (inptr[1] == 'B' || inptr[1] == 'b' || inptr[1] == 'Q' || inptr[1] == 'q') && inptr[2] == '?') {
+			/* looking more and more like an encoded word... */
+			inptr += 3;
+			inptr = memchr (inptr, '?', (inend - 1) - inptr);
+			if (!(inptr && inptr[1] == '='))
+				goto not_encword;
+			
+			if (!dword)
+				append (out, start, encword - start, default_charset, locale_charset);
+			
+			inptr += 2;
+			
+			if ((dword = rfc2047_decode_word (encword, inptr - encword))) {
+				g_string_append (out, dword);
+				g_free (dword);
+			} else {
+				append (out, encword, inptr - encword, default_charset, locale_charset);
+			}
+		} else {
+			/* not an encoded word */
+		not_encword:
+			dword = NULL;
+			inptr = encword + 2;
+			
+			append (out, start, inptr - start, default_charset, locale_charset);
+		}
+	}
+	
+	dword = out->str;
+	g_string_free (out, FALSE);
+	
+	return dword;
+}
+#endif /* ADHERE_TO_SPEC */
 
 char *
 camel_header_decode_string (const char *in, const char *default_charset)
Index: tests/misc/Makefile.am
===================================================================
RCS file: /cvs/gnome/evolution-data-server/camel/tests/misc/Makefile.am,v
retrieving revision 1.13
diff -u -p -r1.13 Makefile.am
--- tests/misc/Makefile.am	13 Dec 2004 03:08:52 -0000	1.13
+++ tests/misc/Makefile.am	28 Feb 2005 21:17:52 -0000
@@ -20,9 +20,10 @@ check_PROGRAMS =  	\
 	url-scan	\
 	utf7		\
 	split		\
+	rfc2047		\
 	test2
 
-TESTS = url utf7 split url-scan test2
+TESTS = url utf7 split url-scan rfc2047 test2
 


Index: tests/misc/rfc2047.c
===================================================================
RCS file: tests/misc/rfc2047.c
diff -N tests/misc/rfc2047.c
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/misc/rfc2047.c	28 Feb 2005 21:17:52 -0000
@@ -0,0 +1,96 @@
+/* -*- Mode: C; tab-width: 8; indent-tabs-mode: t; c-basic-offset: 8 -*- */
+/*
+ *  Authors: Jeffrey Stedfast <fejj novell com>
+ *
+ *  Copyright 2005 Novell, Inc. (www.novell.com)
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Street #330, Boston, MA 02111-1307, USA.
+ *
+ */
+
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <ctype.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <camel/camel-object.h>
+#include <camel/camel-mime-utils.h>
+
+#include "camel-test.h"
+
+struct {
+	const char *encoded;
+	const char *decoded;
+	int ctext;
+} tests[] = {
+	/* the first half are rfc compliant cases (which are the most important) */
+	{ "=?iso-8859-1?q?this=20is=20some=20text?=", "this is some text", 0 },
+	{ "this =?iso-8859-1?q?is_some?= text", "this is some text", 0 },
+	{ "=?iso-8859-1?q?th?= =?iso-8859-1?q?is?= is some text", "this is some text", 0 },
+	{ "=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=  =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=",
+	  "If you can read this you understand the example.", 0 },
+	
+	/* second half: brokenly encoded rfc2047 words */
+	{ "foo=?UTF-8?Q?bar?=baz", "foobarbaz", 0 },
+	{ "foo =?UTF-8?Q?bar?=baz", "foo barbaz", 0 },
+	{ "foo=?UTF-8?Q?bar?= baz", "foobar baz", 0 },
+	{ "=?UTF-8?Q?foo bar?=baz", "foo barbaz", 0 },
+	{ "foo=?UTF-8?Q?bar baz?=", "foobar baz", 0 },
+	{ "=?UTF-8?Q?foo?==?UTF-8?Q?bar baz?=", "foobar baz", 0 },
+	{ "=?UTF-8?Q?foo?= =?UTF-8?Q?bar baz?=", "foobar baz", 0 },
+	{ "=?UTF-8?Q?foo?= bar =?UTF-8?Q?baz?=", "foo bar baz", 0 },
+	{ "=?foo=?UTF-8?Q?bar baz?=", "=?foobar baz", 0 },
+	{ "=?foo?=?UTF-8?Q?bar baz?=", "=?foo?bar baz", 0 },
+	{ "=?foo?Q=?UTF-8?Q?bar baz?=", "=?foo?Qbar baz", 0 },
+	{ "=?foo?Q?=?UTF-8?Q?bar baz?=", "=?foo?Q?bar baz", 0 },
+	{ "=?UTF-8?Q?bar ? baz?=", "=?UTF-8?Q?bar ? baz?=", 0 },
+	
+	/* ctext tests */
+	{ "Test of ctext (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)", "Test of ctext (ab)", 1 },
+	{ "ctext with \\\"quoted\\\" pairs", "ctext with \"quoted\" pairs", 1 }
+};
+
+int main (int argc, char ** argv)
+{
+	char *decoded;
+	int i;
+	
+	camel_test_init (argc, argv);
+	
+	camel_test_start ("rfc2047 decoding");
+	
+	for (i = 0; i < G_N_ELEMENTS (tests); i++) {
+		camel_test_push ("rfc2047 decoding[%d] '%s'", i, tests[i].encoded);
+		
+		if (tests[i].ctext)
+			decoded = camel_header_format_ctext (tests[i].encoded, "iso-8859-1");
+		else
+			decoded = camel_header_decode_string (tests[i].encoded, "iso-8859-1");
+		
+		check (decoded && strcmp (decoded, tests[i].decoded) == 0);
+		
+		g_free (decoded);
+		
+		camel_test_pull ();
+	}
+	
+	camel_test_end();
+
+	return 0;
+}

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]