[xml] [PATCH] Encoding related issues



Hi,

I was playing around with lxml and noticed that it sometimes fails to decode
UTF-16. So I investigated and experimented and ended up writing patches for
more than just the UTF-16 problem.


First problem: missing space after '<!DOCTYPE' is be a fatal error
which should be reported, but it is not.

Example 1:

  static const char content[] =
    "<!DOCTYPEroot><root/>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);

  if (doc != NULL)
    fprintf(stdout, "Ex 1: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 1: success; rejected an invalid document.\n");

You will find a solution in attached patch 1.



Section 4.3.3 of the XML 1.0 standard states XML processors must be
able to read entities in UTF-16. Unicode standard section 3.10
specifies how UTF-16 is read: serialisation order (little endian
vs. big endian) is detected based on leading byte order mark
(BOM). However libxml2 fails to read the mark and assumes that UTF-16 is
always little endian.

The unicode standard is available at
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#page=43

Example 2:

  // UTF-16 (big endian) encoded '<root/>'
  static const char content[] =
    "\xfe\xff\000<\000r\000o\000o\000t\000/\000>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", "UTF-16", 0);
  if (doc == NULL)
    fprintf(stdout, "Ex 2: failure; failed to parse valid document.\n");
  else
    fprintf(stdout, "Ex 2: success; parsed valid document.\n");

The funny thing is that libxml2 fails to parse this document when the
encoding, UTF-16, is correctly specified, but if the encoding argument
is NULL, the encoding is detected and the document parsed correctly.

Attached patch 2 fixes UTF-16 decoding.



Next, not really a bug but a missing feature. UTF-32 can be
autodetected based on a byte order marker, but libxml2 does not do
that. Solution in patch 3.



When parsing encoding declaration or text declaration, encoding variables
are taken to mean something they do not mean, which causes some problems.

First, assume options XML_PARSE_IGNORE_ENC | XML_PARSE_DTDLOAD are
used. Let the document reference an external subset with text
declaration, e.g. <?xml encoding="ascii"?>. Then we get error "Missing
encoding in text declaration". This sort of makes sense---if the
existence of a declaration is ignored, it seems to be missing--but is
probably not correct.

Example 3:

Let there be file ext.xml with content "<?xml encoding='ascii'?>".

  static const char content[] =
    "<!DOCTYPE root SYSTEM 'ext.xml'><root/>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc == NULL)
    fprintf(stdout, "Ex 3: failure; failed to parse valid document.\n");
  else
    fprintf(stdout, "Ex 3: success; parsed valid document.\n");

Also if encoding declaration is ignored (either because the
declaration does not matter, or because of XML_PARSE_IGNORE_ENC
option), missing whitespace after it is not detected.

Example 4:

  // whitespace missing after 'UTF-8'
  static const char content[] =
    "<?xml version='1.0' encoding='UTF-8'standalone='yes'?><root/>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 4: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 4: success; rejected an invalid document.\n");

Patch 4 addresses this bug.



The XML standard states that, in absence of an external encoding
declaration and BOM, it is a fatal error for a document to not be in UTF-8.
This is not reported as it should be.

Example 5:

  // UTF-16BE (no BOM) encoded '<?xml version="1.0"?><root/>'
  static const char content[] =
    "\x00<\x00?\x00x\x00m\x00l\x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00=\x00\"\x00" "1\x00.\x00" 
"0\x00\"\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 5: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 5: success; rejected an invalid document.\n");

The standard also states that in absence of an external encoding
declaration, it is a fatal error for the XML declaration to claim that
the document is in an encoding which it does not actually use. In
several cases this error is ignored.

Example 6. Document is in UTF-16 but claims to be in UTF-8.

  // UTF-16 encoded '<?xml version='1.0' encoding='utf-8'?><root />'
  static const char content[] =
    "\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00=\x00'\x00" "1\x00.\x00" 
"0\x00'\x00 \x00" "e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00'\x00u\x00t\x00" "f\x00-\x00" 
"8\x00'\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00 \x00/\x00>\x00";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 6: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 6: success; rejected an invalid document.\n");


Example 7. Document is in little endian UTF-16 (i.e., has BOM) but
incorrectly claims to be in UTF-16LE (i.e., claims to not have a
BOM). (Alternative interpretation: document is in UTF-16LE as it
claims, but starts with U+FEFF character. Fatal error nonetheless.)


  // UTF-16 (with BOM) '<?xml version='1.0' encoding='utf-16le'?><root/>
  static const char content[] =
"\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00=\x00'\x00" "1\x00.\x00" 
"0\x00'\x00 \x00" "e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00'\x00u\x00t\x00" "f\x00-\x00" "1\x00" 
"6\x00l\x00" "e\x00'\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>\x00";
  int length = sizeof(content);
  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 7: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 7: success; rejected an invalid document.\n");

Example 8. The document may look like valid ascii, but because of the
byte order mark at the very beginning, it is not:

  static const char content[] =
    "\xef\xbb\xbf<?xml version='1.0' encoding='ascii'?><root/>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 8: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 8: success; rejected an invalid document.\n");

Example 9. Change encoding on the fly, ascii -> utf-32.

  static const char content[] = 
    "<?xml version='1.0' 
encoding='utf-32'\x00\x00\x00?\x00\x00\x00>\x00\x00\x00<\x00\x00\x00r\x00\x00\x00o\x00\x00\x00o\x00\x00\x00t\x00\x00\x00/\x00\x00\x00>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 9: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 9: success; rejected an invalid document.\n");

Example 10. Change encoding on the fly, ascii -> cp424 (EBCDIC).

 static const char content[] =
    "<?xml version='1.0' encoding='cp424'onL\x99\x96\x96\xa3\x61n";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 10: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 10: success; rejected an invalid document.\n");

Example 11. Here we have an surrogate pair which is valid in UTF-16 but
invalid in UCS-2.

  // UTF-16 encoded '<?xml version="1.0" encoding="UCS-2"?><U+10000/>'
  static const char content[] =
    "\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00=\x00\"\x00" "1\x00.\x00" 
"0\x00\"\x00 \x00" "e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00\"\x00U\x00" "C\x00S\x00-\x00" 
"2\x00\"\x00?\x00>\x00<\x00\x00\xd8\x00\xdc/\x00>\x00";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 11: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 11: success; rejected an invalid document.\n");

Patch 5 addresses these problems.



There is a small glitch of not growing the input buffer at the right
time. This sometimes leads to errors; parsing the perfectly valid
document in the next example fails.

Example 12:

  // UTF-16LE encoded '<?xml  version = "1.0" encoding = "utf-16le"?><root/>'
  static const char content[] =
    "<\x00?\x00x\x00m\x00l\x00 \x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00 \x00=\x00 \x00\"\x00" 
"1\x00.\x00" "0\x00\"\x00 \x00" "e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00 \x00=\x00 
\x00\"\x00u\x00t\x00" "f\x00-\x00" "1\x00" "6\x00l\x00" 
"e\x00\"\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>\x00";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc == NULL)
    fprintf(stdout, "Ex 12: failure; failed to parse valid document.\n");
  else
    fprintf(stdout, "Ex 12: success; parsed valid document.\n");

Patch 6 sets this right.



XML standard allows an empty external entity, and external entity may
start with a BOM. However a BOM in an empty external entity
confuses libxml2, which assumes that BOM may only occur in a string of
at least 4 bytes. UTF-16 BOM causes parsing to fail, and UTF-8 BOM
is interpreted as a #xFEFF character.

Patch 7 fixes this bugs, and also simplifies the code by avoiding
unnecessary copying of data.


Patch 8 simplifies some unneccessary complicated encoding processing
in HTMLparser.c and some minor things elsewhere.


Patch 9 implements HTML 5 encoding detection algorithm, which is more
extensive and robust than the current encoding sniffing algorithm in
HTMLparser.c. For example, it ignores commented out
declaration. Anyway it is not used by default, only when options
instruct to do so.



I can't say I really like the new code. It is convoluted and repeats itself.
However without breaking backwards compatibility I could do no better.

I'd welcome feedback especially about these questions:

Exactly when should we use ctxt->encoding and when ctxt->input->encoding?

xmlSwitchEncoding() in parserInternals.c assumed that input in
UTF-16LE UTF-16BE might contain UTF-8 BOM "As we expect this function
to be called after xmlCharEncInFunc". Why? xmlCharEncInFunc() seems to
be never called. Also, xmlCharEncInFunc has already been decoded (why
else would the BOM be in UTF-8?), can xmlSwitchEncoding() just set out
to decoding it again as it does. Overall this just seems wrong, but
there may be something I missed.

Does UCS-2 have different schemas, with/without BOM, like UTF-16?
How about UCS-4?

XML standard says that UTF-16 must have BOM. Should missing BOM be a
XML_ERR_WARNING, XML_ERR_ERROR, or something else?



Attached you will find the patches mentioned above, all the code examples
given above, and an example xml associated with one example.


Regards
 Olli Pottonen



Attachment: bugdemo.c
Description: Binary data

Attachment: ext.xml
Description: application/xml

Attachment: patch1.txt
Description: Text document

Attachment: patch2.txt
Description: Text document

Attachment: patch3.txt
Description: Text document

Attachment: patch4.txt
Description: Text document

Attachment: patch5.txt
Description: Text document

Attachment: patch6.txt
Description: Text document

Attachment: patch7.txt
Description: Text document

Attachment: patch8.txt
Description: Text document

Attachment: patch9.txt
Description: Text document



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]