Re: [xml] xmlCheckUTF8-problem (bugfix) [encrypted]
- From: Julius Mittenzwei <julius muenchen-sued de>
- To: "William M. Brack" <wbrack mmm com hk>
- Cc:
- Subject: Re: [xml] xmlCheckUTF8-problem (bugfix) [encrypted]
- Date: Sat, 28 Aug 2004 14:35:32 +0200
Hi Bill,
i have written my own xmlCheckUTF8 function. Maybe this would solve the
problem.
--------------------
int
xmlCheckUTF8(const unsigned char *utf)
{
int ix;
unsigned char c;
for (ix = 0; (c = utf[ix]);) {
if ((c & 0x80) == 0x00) /* 1-byte code, starts with 0 */
{
ix++;
}
else if ((c & 0xe0) == 0xc0) /* 2-byte code, starts with
110 */
{
if (( utf[ix + 1] & 0xc0 ) != 0x80 )
return 0;
ix += 2;
}
else if ((c & 0xf0) == 0xe0) /* 4-byte code, starts with
1110 */
{
if ((( utf[ix + 1] & 0xc0 ) != 0x80 )||
(( utf[ix + 2] & 0xc0 ) != 0x80 ))
return 0;
ix += 3;
}
else if ((c & 0xf8) == 0xf0) /* 4-byte code, starts with
11110*/
{
if ((( utf[ix + 1] & 0xc0 ) != 0x80 )||
(( utf[ix + 2] & 0xc0 ) != 0x80 )||
(( utf[ix + 3] & 0xc0 ) != 0x80 ))
return 0;
ix += 4;
}
else /* unknown encoding */
return 0;
}
return(1);
}
--------------------
On 28.08.2004, at 03:24, William M. Brack wrote:
Julius Mittenzwei [c] said:
Hi again,
i tried to trace the Problem a bit.
A valid 2-byte utf8 char must be something like:
110xxxxx 10xxxxxx (http://de.wikipedia.org/wiki/UTF8)
I would suggest to change this line:
if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
in
xmlstring.c
to
if ((c & 0xe0) != 0xc0 || ( utf[ix + 1] & 0xc0 ) != 0x80 )
it "ands" the "c" with 11100000=0xe0 to get the first 3 bits.
If this is exactly 11000000=0xc0 you can be sure, that the byte
starts
with "110".
Regards
/Julius
hmmm... I'm afraid I can't agree with that. Remember that UTF8 data
is a
"string" which can be 1, 2, 3 or even 4 bytes long (rfc3629). So, for
a
3-byte string the value "0xe0" is equally valid :-(.
Despite this minor disagreement, I totally agree with you that there
is a
problem, and it needs to be fixed. I did a little "history checking"
and
found that this particular line of code was recently changed, and the
change
was because of http://bugzilla.gnome.org/show_bug.cgi?id=148115. Very
unfortunately, as you have pointed out, our fix for that bug was not
totally
satisfactory :-\.
I have re-examined that area of coding, and have (hopefully) enhanced
it to a
state where it should take care of all of the different cases correctly
(basically I changed the first half of the above 'if' to check equal
to 0xc0).
I also added several comments along the way to show what I (think)
I'm doing
:-). Could you check out the revised routine from CVS and see if it
solves
your case satisfactorily? Thanks for the report, and for your help!
Bill
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]