[xml] Possible bug with iconv-less UTF8 to ISO-8859-15 conversion
- From: Mark Itzcovitz <mark itzcovitz ntlworld com>
- To: xml gnome org
- Subject: [xml] Possible bug with iconv-less UTF8 to ISO-8859-15 conversion
- Date: Wed, 8 Sep 2004 17:32:34 +0000
I believe there is a bug in the routine UTF8ToISO8859x in encoding.c when converting multi-byte UTF-8
characters.
If I run the version of xmllint that comes with Solaris 9 (presumably built with iconv) with the following
command:
xmllint --encode ISO-8859-15 utf8-15.xml
I see the output contains all the correct characters (viewed in a terminal emulator (Reflection) with the
host character set set to 8859-15). The file utf8-15.xml comes from the attachment 8859x-tests.tar.gz in the
archive item "[xml] [PATCH] Character encodng cleanup".
If I run my xmllint (2.6.13 built without iconv) I don't see the actual characters, I see the values, e.g.
´.
I think the problem is about 32 lines down in UTF8ToISO8859x in encoding.c. The line that reads
if ((c & 0xC0) != 0xC0) {
should read
if ((c & 0xC0) != 0x80) {
since the second byte of a UTF-8 sequence must be of the form 10bbbbbb. If I make this change then my xmllint
outputs the expected characters rather than the values - that is, apart from the euro symbol, which I will
look into tomorrow.
There are also two lines of code further down, for three-byte sequences, which I think need changing in the
same way. They are:
if ((c1 & 0xC0) != 0xC0) {
and
if ((c2 & 0xC0) != 0xC0) {
Hopefully someone else can verify that I on the right lines.
Mark
-----------------------------------------
Email provided by http://www.ntlhome.com/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]