Re: [patch] treatment of wrongly encoded 8-bit messages



if you guys want to play with black magic, feel free to check out
http://primates.ximian.com/~fejj/charset-foo.[c,h]

a little something I've been working on the past 2 nights to try and
auto-detect what charset a given stream of text is in... seems to work
okay. (don't expect iso-8859-8 or iso-8859-4 texts to be recognized yet
tho... and iso-8859-5 is also a bit sketchy - koi8-r seems to work well,
tho - and that is probably more important than iso-8859-5 for russian
anyway).

just so you know, these are the charsets it *attempts* to check for:

        { "iso-8859-1", 0x20 },
        { "iso-8859-2", 0x40 },
        { "iso-8859-4", 0x80 },
        { "iso-8859-5", 0x100 },
        { "iso-8859-7", 0x200 },
        { "iso-8859-8", 0x400 },
        { "iso-8859-9", 0x800 },
        { "iso-8859-13", 0x1000 },
        { "iso-8859-15", 0x2000 },
        { "windows-1251", 0x4000 },
        { "koi8-r", 0x8000 },
        { "koi8-u", 0x10000 },
        { "shift-jis", 0x20000 },
        { "gb2312", 0x40000 },
        { "euc-jp", 0x80000 },
        { "euc-kr", 0x100000 },
        { "euc-tw", 0x200000 },
        { "big5", 0x400000 },

hmmm, I should remove euc-tw... don't have any samples for that and it
is very uncommon anyway.

btw, if any of you have text documents in any of those charsets (in
particular -4, -5, -8 and shift-jis since I am severely lacking in those
departments currently), feel free to send them to me so I can improve
support for detecting those charsets.

(note: make sure they contain nothing personal)

Jeff

On Thu, 2003-04-03 at 13:48, Albrecht Dreß wrote:
> Am 03.04.03 11:08 schrieb(en) Pawel Salek:
> > When I try to reply such a misformatted message, I get loads of
> > 
> > (balsa:9949): Gtk-CRITICAL **: file gtktextbuffer.c: line 543 
> > (gtk_text_buffer_emit_insert): assertion `g_utf8_validate (text, len, 
> > NULL)' failed
> 
> Ooops... As the replacement of badly encoded chars was moved out of 
> libmutt, content2reply now gets the wrong stream with bad chars... An 
> extra libbalsa_utf_sanitize() fixes it (see below). The same problem 
> occurs with printing, btw (also fixed below).
> 
> The patch also removes an extra paranoia check in the gpg stuff, which is 
> a result of the wonders of copy & paste, but completely silly at that 
> point.
> 
> Sorry for the chaos,
> 
> 	Cheers,
> 
> 	Albrecht.

-- 
Jeffrey Stedfast
Evolution Hacker - Ximian, Inc.
fejj@ximian.com  - www.ximian.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]