[xml] Possible bug with iconv-less UTF8 to ISO-8859-15 conversion



I believe there is a bug in the routine UTF8ToISO8859x in encoding.c when converting multi-byte UTF-8 
characters.

If I run the version of xmllint that comes with Solaris 9 (presumably built with iconv) with the following 
command:

xmllint --encode ISO-8859-15 utf8-15.xml

I see the output contains all the correct characters (viewed in a terminal emulator (Reflection) with the 
host character set set to 8859-15). The file utf8-15.xml comes from the attachment 8859x-tests.tar.gz in the 
archive item "[xml] [PATCH] Character encodng cleanup".

If I run my xmllint (2.6.13 built without iconv) I don't see the actual characters, I see the values, e.g. 
&#180.

I think the problem is about 32 lines down in UTF8ToISO8859x in encoding.c. The line that reads

           if ((c & 0xC0) != 0xC0) {

should read

           if ((c & 0xC0) != 0x80) {

since the second byte of a UTF-8 sequence must be of the form 10bbbbbb. If I make this change then my xmllint 
outputs the expected characters rather than the values - that is, apart from the euro symbol, which I will 
look into tomorrow.

There are also two lines of code further down, for three-byte sequences, which I think need changing in the 
same way. They are:

            if ((c1 & 0xC0) != 0xC0) {

and

            if ((c2 & 0xC0) != 0xC0) {

Hopefully someone else can verify that I on the right lines.

Mark


-----------------------------------------
Email provided by http://www.ntlhome.com/





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]