Re: [xml] xmlCheckUTF8-problem (bugfix) [signed]



Julius Mittenzwei [c] said:
Hi again,

i tried to trace the Problem a bit.

A valid 2-byte utf8 char must be something like:

110xxxxx 10xxxxxx (http://de.wikipedia.org/wiki/UTF8)

I would suggest to change this line:

      if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
in
     xmlstring.c
to
      if ((c & 0xe0) != 0xc0  || ( utf[ix + 1] & 0xc0 ) != 0x80 )

it "ands" the "c" with 11100000=0xe0 to get the first 3 bits.
If this is exactly  11000000=0xc0 you can be sure, that the byte starts
with "110".

Regards
/Julius

hmmm...  I'm afraid I can't agree with that.  Remember that UTF8 data is a
"string" which can be 1, 2, 3 or even 4 bytes long (rfc3629).  So, for a
3-byte string the value "0xe0" is equally valid :-(.

Despite this minor disagreement, I totally agree with you that there is a
problem, and it needs to be fixed.  I did a little "history checking" and
found that this particular line of code was recently changed, and the change
was because of http://bugzilla.gnome.org/show_bug.cgi?id=148115.  Very
unfortunately, as you have pointed out, our fix for that bug was not totally
satisfactory :-\.

I have re-examined that area of coding, and have (hopefully) enhanced it to a
state where it should take care of all of the different cases correctly
(basically I changed the first half of the above 'if' to check equal to 0xc0).
 I also added several comments along the way to show what I (think) I'm doing
:-).  Could you check out the revised routine from CVS and see if it solves
your case satisfactorily?  Thanks for the report, and for your help!

Bill




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]