Re: [xml] xmlCheckUTF8-problem (bugfix) [encrypted]



Hi Bill,
i have written my own xmlCheckUTF8 function. Maybe this would solve the problem.

--------------------
int
xmlCheckUTF8(const unsigned char *utf)
{
    int ix;
    unsigned char c;

    for (ix = 0; (c = utf[ix]);) {
            if ((c & 0x80) == 0x00) /* 1-byte code, starts with 0 */
            {
                    ix++;
            }
else if ((c & 0xe0) == 0xc0) /* 2-byte code, starts with 110 */
            {
                    if (( utf[ix + 1] & 0xc0 ) != 0x80 )
                            return 0;
                    ix += 2;
            }
else if ((c & 0xf0) == 0xe0) /* 4-byte code, starts with 1110 */
            {
                    if ((( utf[ix + 1] & 0xc0 ) != 0x80 )||
                        (( utf[ix + 2] & 0xc0 ) != 0x80 ))
                            return 0;
                    ix += 3;
            }
else if ((c & 0xf8) == 0xf0) /* 4-byte code, starts with 11110*/
            {
                    if ((( utf[ix + 1] & 0xc0 ) != 0x80 )||
                        (( utf[ix + 2] & 0xc0 ) != 0x80 )||
                        (( utf[ix + 3] & 0xc0 ) != 0x80 ))
                            return 0;
                    ix += 4;
            }
            else /* unknown encoding */
                    return 0;
      }
      return(1);
}

--------------------


On 28.08.2004, at 03:24, William M. Brack wrote:

Julius Mittenzwei [c] said:
Hi again,

i tried to trace the Problem a bit.

A valid 2-byte utf8 char must be something like:

110xxxxx 10xxxxxx (http://de.wikipedia.org/wiki/UTF8)

I would suggest to change this line:

      if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
in
     xmlstring.c
to
      if ((c & 0xe0) != 0xc0  || ( utf[ix + 1] & 0xc0 ) != 0x80 )

it "ands" the "c" with 11100000=0xe0 to get the first 3 bits.
If this is exactly 11000000=0xc0 you can be sure, that the byte starts
with "110".

Regards
/Julius

hmmm... I'm afraid I can't agree with that. Remember that UTF8 data is a "string" which can be 1, 2, 3 or even 4 bytes long (rfc3629). So, for a
3-byte string the value "0xe0" is equally valid :-(.

Despite this minor disagreement, I totally agree with you that there is a problem, and it needs to be fixed. I did a little "history checking" and found that this particular line of code was recently changed, and the change
was because of http://bugzilla.gnome.org/show_bug.cgi?id=148115.  Very
unfortunately, as you have pointed out, our fix for that bug was not totally
satisfactory :-\.

I have re-examined that area of coding, and have (hopefully) enhanced it to a
state where it should take care of all of the different cases correctly
(basically I changed the first half of the above 'if' to check equal to 0xc0). I also added several comments along the way to show what I (think) I'm doing :-). Could you check out the revised routine from CVS and see if it solves
your case satisfactorily?  Thanks for the report, and for your help!

Bill

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]