Re: [patch] be more liberal in accepting wrongly encoded nationalchars



On Sun, 2003-02-09 at 12:14, Albrecht Dreß wrote:
> Am 09.02.03 14:42 schrieb(en) Steffen Klemer:
> > Why make balsa even more complicate to underrstand and configure for 
> > people don't want to know anything 'bout "charset", "encoding" and 
> > "standarts"?
> > Other mailers can handle such clutter as well!?
> > A possible solution for me would be to guess the default-charset from 
> > the locale (it was invented for things like that) and put a small txt 
> > telling that Balsa presumes it is an "...-Mail".
> 
> The locale is not the complete solution. If you use e.g. de_DE@iso-10646, 
> you're lost. If the broken mail comes from a Microsnot system, chances are 
> high that the contents is windows-1252, which is almost (but not 
> completely) 8859-1.

checking for windows-125x is extremely simple to do:

/* We don't really use the charset argument except for debugging... */
static gboolean
broken_windows_charset (GByteArray *buffer, const char *charset)
{
	register unsigned char *inptr;
	unsigned char *inend;
	
	inptr = buffer->data;
	inend = inptr + buffer->len;
	
	while (inptr < inend) {
		register unsigned char c = *inptr++;
		
		if (c >= 128 && c <= 159) {
			g_warning ("Encountered Windows charset parading as %s", charset);
			return TRUE;
		}
	}
	
	return FALSE;
}

const char *
iso_charset_to_windows (const char *isocharset)
{
	/* According to http://czyborra.com/charsets/codepages.html,
	 * the charset mapping is as follows:
	 *
	 * us-ascii    maps to windows-cp1252
	 * iso-8859-1  maps to windows-cp1252
	 * iso-8859-2  maps to windows-cp1250
	 * iso-8859-3  maps to windows-cp????
	 * iso-8859-4  maps to windows-cp????
	 * iso-8859-5  maps to windows-cp1251
	 * iso-8859-6  maps to windows-cp1256
	 * iso-8859-7  maps to windows-cp1253
	 * iso-8859-8  maps to windows-cp1255
	 * iso-8859-9  maps to windows-cp1254
	 * iso-8859-10 maps to windows-cp????
	 * iso-8859-11 maps to windows-cp????
	 * iso-8859-12 maps to windows-cp????
	 * iso-8859-13 maps to windows-cp1257
	 *
	 * Assumptions:
	 *  - I'm going to assume that since iso-8859-4 and
	 *    iso-8859-13 are Baltic that it also maps to
	 *    windows-cp1257.
	 */
	
	if (!strcasecmp (isocharset, "iso-8859-1") || !strcasecmp (isocharset,
"us-ascii"))
		return "windows-cp1252";
	else if (!strcasecmp (isocharset, "iso-8859-2"))
		return "windows-cp1250";
	else if (!strcasecmp (isocharset, "iso-8859-4"))
		return "windows-cp1257";
	else if (!strcasecmp (isocharset, "iso-8859-5"))
		return "windows-cp1251";
	else if (!strcasecmp (isocharset, "iso-8859-6"))
		return "windows-cp1256";
	else if (!strcasecmp (isocharset, "iso-8859-7"))
		return "windows-cp1253";
	else if (!strcasecmp (isocharset, "iso-8859-8"))
		return "windows-cp1255";
	else if (!strcasecmp (isocharset, "iso-8859-9"))
		return "windows-cp1254";
	else if (!strcasecmp (isocharset, "iso-8859-13"))
		return "windows-cp1257";
	
	return isocharset;
}

btw - if you guys ever figure out any of the windows-cp????, please let
me know :-)

as far as whether Balsa should detect stuff (windows charsets and/or
simply 'unknown's), that's a judgement call for the developers to make.
I won't comment on that.

I will say that autodetecting the charset given unstructured
8bit/multibyte text is basically "impossible".

I will say however that emacs and mozilla seem to have some limited
support for it though... although I'm not sure it's worth the effort?

Evolution tries a few different charsets - I think if there is no
charset, it checks to see if it is UTF-8, failing that it'll try locale
and maybe the body charset if it is specified (I'm talking about raw
8bit headers here). For the body, Evolution gives the user the ability
to override the display charset if he/she wishes.

Jeff

-- 
Jeffrey Stedfast <fejj@stampede.org>




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]