Re: [xml] xmlCheckUTF8-problem (bugfix) [signed]

Julius Mittenzwei [c] said:
Hi again,

i tried to trace the Problem a bit.

A valid 2-byte utf8 char must be something like:

110xxxxx 10xxxxxx (

I would suggest to change this line:

      if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
      if ((c & 0xe0) != 0xc0  || ( utf[ix + 1] & 0xc0 ) != 0x80 )

it "ands" the "c" with 11100000=0xe0 to get the first 3 bits.
If this is exactly  11000000=0xc0 you can be sure, that the byte starts
with "110".


hmmm...  I'm afraid I can't agree with that.  Remember that UTF8 data is a
"string" which can be 1, 2, 3 or even 4 bytes long (rfc3629).  So, for a
3-byte string the value "0xe0" is equally valid :-(.

Despite this minor disagreement, I totally agree with you that there is a
problem, and it needs to be fixed.  I did a little "history checking" and
found that this particular line of code was recently changed, and the change
was because of  Very
unfortunately, as you have pointed out, our fix for that bug was not totally
satisfactory :-\.

I have re-examined that area of coding, and have (hopefully) enhanced it to a
state where it should take care of all of the different cases correctly
(basically I changed the first half of the above 'if' to check equal to 0xc0).
 I also added several comments along the way to show what I (think) I'm doing
:-).  Could you check out the revised routine from CVS and see if it solves
your case satisfactorily?  Thanks for the report, and for your help!


