Re: [evolution-patches] Fix for Evolution 1.2 shortcut migration



> 
> > <item name="R&#195;&#169;sum&#195;&#169;"...
> 
> Right. The problem is that libxml1 wrote out the UTF8 wrong (storing
> each *byte* of the UTF8-encoded string as a separate entity instead of
> storing each *character* as its own entity). So when you read it into
> libxml2, each byte of UTF8 encoding becomes a separate character and you
> end up with "RÃésumÃé".
> 
> Converting it to locale encoding isn't the right fix though; you
> essentially want to convert to iso-8859-1 regardless of what the locale
> encoding is (because that reverses the translation above: the "Ã"s
> become 0xC3, and the "é"s become 0xE9, and then when you hand the data
> back to libxml, it sees "0x52 0xC3 0xE9 0x73 0x75 0x6D 0xC3 0xE9", which
> is the UTF-8 encoding of "Résumé").
> 
> But it would be less confusing to just do the transformation by hand,
> since you don't really mean "convert from utf-8 to iso-8859-1", you just
> mean "replace each multibyte utf-8 character with the corresponding
> single-byte value".

Hmm, I'm not so sure of that : it will work for iso-8859-1 badly libxml1
encoded strings (ie French) but I'm not sure it will work for non
ISO8859-1 encoded strings (like Chinese ...)

Naah what dan sais is the content is 8 bit utf8 converted to xml entities byte-by-byte rather than as unicode characters.

So what you do is take the input stream, read it as utf8, but then take each unicode character input as a single utf8 byte, rather than as a gunichar_t.

Chinese for e.g. will have multiple chars encoded similarly, e.g. a 4 byte sequence will be encoded like

ABCD

but if either ABC or D is > 7 bits it'll be encoded as if it was an iso-8859-1 character in 2 utf8 bytes.

e.g. AaBCcD, which is how libxml2 will read it back as.

Does that make any more sense?

Michael



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]