Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJKheader

From: jacky <gtkdict yahoo com cn>
To: Jeff Stedfast <fejj novell com>, evolution-hackers gnome org
Subject: Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJKheader
Date: Mon, 24 Dec 2007 13:21:44 +0800 (CST)

--- Jeff Stedfast <fejj novell com>wrote:

> Hi Jacky,
> 
> I've looked over your patch, but unfortunately it is
> unusable. The patch
> is riddled with buffer overflows and incorrect
> logic.
> 

Yes, I use a fixed length string to store some value,
it maybe overflow. I write another version by using
heap insteads of stack. I think the stack version is
simple and enough, so I  send it only. Two version of
rfc2047_decode_word() is in attachment.
Can you explain the incorrect logic in my patch?

> What types of bugs are you actually trying to fix?
> What is it about CJK
> messages in particular that are not getting decoded
> properly? Your email
> was overly vague.
> 

Maybe I used the wrong word. I think I just enhance
the CJK header support. The patch enhance three point:
1) You know, encoded-words must be separated by CRLF
SPACE, but some email client do not do that.
2) A CJK character's encoded string must in an
encoded-word, but some email client divide it into two
encoded-words.
3) Some CJK character need to encode to GBK charset,
but the charset name in encoded-word is GB2312.

There are two kind of email need to support:
1) An encoded-word was divided into two line. This was
sent by dotProject v2.0.1 .
2) Use GB2312 to encode CJK character directly. Some
of them was supported by evolution, but some of them
didn't.

> Your changes to e-iconv can probably be taken if I
> understand correctly
> that GBK is a superset of gb2312 (
> http://en.wikipedia.org/wiki/GBK ),
> altho it would have been nice to have gotten some
> sort of link
> explaining that with your original email (or via a
> ChangeLog entry) :)
> 
> Thanks,
> 
> Jeff
> 
> >>> jacky <gtkdict yahoo com cn> 12/23/07 10:09 AM
> >>>
> Hi, all.
> 
> The rfc2047 decoder in libcamel can not decode some
> CJK header correctly. Although some of them are not
> correspond to RFC, but I need to decode it correctly
> and I thought if evolution can display there email
> correctly more people like it.
> 
> So I write a new rfc2047 decoder, and it's in the
> patch. With the patch, libcamel can decode CJK
> header
> correctly and evolution can display CJK header
> correctly now. I had test it in my mailbox. My
> mailbox
> has 2000 emails which were sent by evolution,
> thunderbird, outlook, outlook express, foxmail, open
> webmail, yahoo, gmail, lotus notes, etc. Without
> this
> patch, almost 20% of there emails can't be decoded
> and
> displayed correctly, with this patch, 99% of there
> emails can be decoded and displayed correctly.
> 
> And I found that the attachment with CJK name can't
> be
> recognised and displayed by outlook / outlook
> express
> / foxmail. This is because there email clients do
> not
> support RFC2184. Evolution always use RFC2184 encode
> mothod to encode attachment name, so the email with
> CJK named attachment can't display in outlook /
> outlook express / foxmail. In thunderbird, you can
> set
> the option "mail.strictly_mime.parm_folding" to 0 or
> 1
> for using RFC2047 encode mothod to encode attachment
> name. Can we add a similar option?
> 
> Best regards.
> 



      ___________________________________________________________ 
雅虎邮箱传递新年祝福，个性贺卡送亲朋！ 
http://cn.mail.yahoo.com/gc/index.html?entry=5&souce=mail_mailletter_tagline

/* decode rfc 2047 encoded string segment */
#define DECWORD_LEN 1024
#define UTF8_DECWORD_LEN 2048

#if 1 //USE_STACK
static char *
rfc2047_decode_word(const char *in, size_t len)
{
	char prev_charset[32], curr_charset[32];
	char encode;
	char *start, *inptr, *inend;
	char decword[DECWORD_LEN], utf8_decword[UTF8_DECWORD_LEN];
	char *decword_ptr, *utf8_decword_ptr;
	size_t inlen, outlen, ret;

	prev_charset[0] = curr_charset[0] = '\0';

	decword_ptr = decword;
	utf8_decword_ptr = utf8_decword;

	/* quick check to see if this could possibly be a real encoded word */
	if (len < 8
	    || !(in[0] == '=' && in[1] == '?'
		 && in[len-1] == '=' && in[len-2] == '?')) {
		return NULL;
	}

	inptr = in;
	inend = in + len;
	outlen = sizeof(utf8_decword);

	while (inptr < inend) {
		/* begin */
		inptr = memchr (inptr, '?', inend-inptr);
		if (!inptr || *(inptr-1) != '=') {
			return NULL;
		}
		inptr++;

		/* charset */
		start = inptr;
		inptr = memchr (inptr, '?', inend-inptr);
		if (!inptr) {
			return NULL;
		}
		strncpy (curr_charset, start, inptr-start); /* maybe overflow */
		curr_charset[inptr-start] = '\0';
		if (prev_charset[0] == '\0') { /* first charset in multi encode words */
			strcpy (prev_charset, curr_charset);
		}
		d(printf ("curr_charset = %s\n", curr_charset));

		/* if (charset.perv != charset.curr) iconv perv to utf8 */
		if (prev_charset[0] != '\0' && strcmp(prev_charset, curr_charset)) {
			inlen = decword_ptr - decword;
			ret = conv_to_utf8 (prev_charset, decword, inlen, utf8_decword_ptr, outlen);
			if (ret == (size_t)-1) {
				printf ("conv_to_utf8() error!\n");
				return NULL;
			}

			utf8_decword_ptr += ret;
			outlen = outlen - ret;

			decword_ptr = decword; /* reset decword_ptr */
			strcpy (prev_charset, curr_charset);
		}

		/* encode */
		inptr++;
		encode = *inptr;
		inptr++;
		if (*inptr != '?') {
			return NULL;
		}

		/* text */
		inptr++;
		start = inptr;
		inptr = memchr (inptr, '?', inend-inptr);
		if (!inptr || *(inptr+1) != '=') {
			return NULL;
		}

		/* decode */
		switch(encode) {
		case 'Q':
		case 'q':
			inlen = quoted_decode(start, inptr-start, decword_ptr);
			break;
		case 'B':
		case 'b':
			{
				int state = 0;
				unsigned int save = 0;

				inlen = camel_base64_decode_step(start, inptr-start, decword_ptr, &state, &save);
				/* if state != 0 then error? */
			}
			break;
		default:
			/* uhhh, unknown encoding type - probably an invalid encoded word string */
			return NULL;
		}

		if (inlen > 0) {
			decword_ptr += inlen;
		} else {
			return NULL;
		}

		inptr += 2;	/* skip '?=' */
	} /* end of "while (inptr < inend)" */

	/* at last, iconv to utf8 */
	inlen = decword_ptr - decword;
	ret = conv_to_utf8 (curr_charset, decword, inlen, utf8_decword_ptr, outlen);
	if (ret == (size_t)-1) {
		printf ("conv_to_utf8() error!\n");
		return NULL;
	}

	utf8_decword_ptr += ret;
	*utf8_decword_ptr = '\0';

	return strdup (utf8_decword);
}
#else  /* USE HEAP */
static char *
rfc2047_decode_word(const char *in, size_t len)
{
	char *prev_charset, *curr_charset;
	char encode;
	char *start, *inptr, *inend;
	char *decword, *decword_ptr;
	char *utf8_decword, *utf8_decword_ptr;
	size_t inlen, outlen, ret;

	prev_charset = curr_charset = NULL;

	decword = g_malloc (DECWORD_LEN);
	if (!decword) {
		return NULL;
	}
	decword_ptr = decword;

	utf8_decword = g_malloc (UTF8_DECWORD_LEN);
	if (!utf8_decword) {
		g_free (decword);
		return NULL;
	}
	utf8_decword_ptr = utf8_decword;

	/* quick check to see if this could possibly be a real encoded word */
	if (len < 8
	    || !(in[0] == '=' && in[1] == '?'
		 && in[len-1] == '=' && in[len-2] == '?')) {
		goto _error_return;
	}

	inptr = in;
	inend = in + len;
	outlen = UTF8_DECWORD_LEN;

	while (inptr < inend) {
		/* begin */
		inptr = memchr (inptr, '?', inend-inptr);
		if (!inptr || *(inptr-1) != '=') {
			goto _error_return;
		}
		inptr++;

		/* charset */
		start = inptr;
		inptr = memchr (inptr, '?', inend-inptr);
		if (!inptr) {
			goto _error_return;
		}
		if (curr_charset) {
			free (curr_charset);
		}
		curr_charset = strndup (start, inptr-start);
		d(printf ("curr_charset = %s\n", curr_charset));
		if (prev_charset == NULL) {
			prev_charset = strdup (curr_charset);
		}

		/* if (charset.perv != charset.curr) iconv perv to utf8 */
		if (prev_charset && strcmp(prev_charset, curr_charset)) {
			inlen = decword_ptr - decword;
			ret = conv_to_utf8 (prev_charset, decword, inlen, utf8_decword_ptr, outlen);
			if (ret == (size_t)-1) {
				printf ("conv_to_utf8() error!\n");
				/* or maybe we should grow 'utf8_decword' */
				goto _error_return;
			}

			utf8_decword_ptr += ret;
			outlen = outlen - ret;

			decword_ptr = decword; /* decword_ptr reset */
			if (perv_charset) {
				free (perv_charset);
			}
			perv_charset = strdup (curr_charset);
		}

		/* encode */
		inptr++;
		encode = *inptr;
		inptr++;
		if (*inptr != '?') {
			goto _error_return;
		}

		/* text */
		inptr++;
		start = inptr;
		inptr = memchr (inptr, '?', inend-inptr);
		if (!inptr || *(inptr+1) != '=') {
			goto _error_return;
		}

		/* decode */
		switch(encode) {
		case 'Q':
		case 'q':
			inlen = quoted_decode(start, inptr-start, decword_ptr);
			break;
		case 'B':
		case 'b':
			{
				int state = 0;
				unsigned int save = 0;

				inlen = camel_base64_decode_step(start, inptr-start, decword_ptr, &state, &save);
				/* if state != 0 then error? */
			}
			break;
		default:
			/* uhhh, unknown encoding type - probably an invalid encoded word string */
			goto _error_return;
		}

		if (inlen > 0) {
			decword_ptr += inlen;
		} else {
			/* or maybe we should grow 'decword' */
			goto _error_return;
		}

		inptr += 2;	/* skip "?=" */
	} /* end of "while (inptr < inend)" */

	/* at last, iconv to utf8 */
	inlen = decword_ptr - decword;
	ret = conv_to_utf8 (curr_charset, decword, inlen, utf8_decword_ptr, outlen);
	if (ret == (size_t)-1) {
		printf ("conv_to_utf8() error!\n");
		/* or maybe we should grow 'utf8_decword' */
		goto _error_return;
	}

	utf8_decword_ptr += ret;
	*utf8_decword_ptr = '\0';

	g_free (decword);
	if (prev_charset) {
		free (prev_charset);
	}
	if (curr_charset) {
		free (curr_charset);
	}

	return utf8_decword;

 _error_return:
	g_free (decword);
	g_free (utf8_decword);
	if (prev_charset) {
		free (prev_charset);
	}
	if (curr_charset) {
		free (curr_charset);
	}
  
	return NULL;
}
#endif

Follow-Ups:
- Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJKheader
  - From: Peter Volkov
- Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJKheader
  - From: Philip Van Hoof

References:
- Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJKheader
  - From: Jeff Stedfast

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]