Hi, I was playing around with lxml and noticed that it sometimes fails to decode UTF-16. So I investigated and experimented and ended up writing patches for more than just the UTF-16 problem. First problem: missing space after '<!DOCTYPE' is be a fatal error which should be reported, but it is not. Example 1: static const char content[] = "<!DOCTYPEroot><root/>"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 1: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 1: success; rejected an invalid document.\n"); You will find a solution in attached patch 1. Section 4.3.3 of the XML 1.0 standard states XML processors must be able to read entities in UTF-16. Unicode standard section 3.10 specifies how UTF-16 is read: serialisation order (little endian vs. big endian) is detected based on leading byte order mark (BOM). However libxml2 fails to read the mark and assumes that UTF-16 is always little endian. The unicode standard is available at http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#page=43 Example 2: // UTF-16 (big endian) encoded '<root/>' static const char content[] = "\xfe\xff\000<\000r\000o\000o\000t\000/\000>"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", "UTF-16", 0); if (doc == NULL) fprintf(stdout, "Ex 2: failure; failed to parse valid document.\n"); else fprintf(stdout, "Ex 2: success; parsed valid document.\n"); The funny thing is that libxml2 fails to parse this document when the encoding, UTF-16, is correctly specified, but if the encoding argument is NULL, the encoding is detected and the document parsed correctly. Attached patch 2 fixes UTF-16 decoding. Next, not really a bug but a missing feature. UTF-32 can be autodetected based on a byte order marker, but libxml2 does not do that. Solution in patch 3. When parsing encoding declaration or text declaration, encoding variables are taken to mean something they do not mean, which causes some problems. First, assume options XML_PARSE_IGNORE_ENC | XML_PARSE_DTDLOAD are used. Let the document reference an external subset with text declaration, e.g. <?xml encoding="ascii"?>. Then we get error "Missing encoding in text declaration". This sort of makes sense---if the existence of a declaration is ignored, it seems to be missing--but is probably not correct. Example 3: Let there be file ext.xml with content "<?xml encoding='ascii'?>". static const char content[] = "<!DOCTYPE root SYSTEM 'ext.xml'><root/>"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc == NULL) fprintf(stdout, "Ex 3: failure; failed to parse valid document.\n"); else fprintf(stdout, "Ex 3: success; parsed valid document.\n"); Also if encoding declaration is ignored (either because the declaration does not matter, or because of XML_PARSE_IGNORE_ENC option), missing whitespace after it is not detected. Example 4: // whitespace missing after 'UTF-8' static const char content[] = "<?xml version='1.0' encoding='UTF-8'standalone='yes'?><root/>"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 4: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 4: success; rejected an invalid document.\n"); Patch 4 addresses this bug. The XML standard states that, in absence of an external encoding declaration and BOM, it is a fatal error for a document to not be in UTF-8. This is not reported as it should be. Example 5: // UTF-16BE (no BOM) encoded '<?xml version="1.0"?><root/>' static const char content[] = "\x00<\x00?\x00x\x00m\x00l\x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00=\x00\"\x00" "1\x00.\x00" "0\x00\"\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 5: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 5: success; rejected an invalid document.\n"); The standard also states that in absence of an external encoding declaration, it is a fatal error for the XML declaration to claim that the document is in an encoding which it does not actually use. In several cases this error is ignored. Example 6. Document is in UTF-16 but claims to be in UTF-8. // UTF-16 encoded '<?xml version='1.0' encoding='utf-8'?><root />' static const char content[] = "\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00=\x00'\x00" "1\x00.\x00" "0\x00'\x00 \x00" "e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00'\x00u\x00t\x00" "f\x00-\x00" "8\x00'\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00 \x00/\x00>\x00"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 6: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 6: success; rejected an invalid document.\n"); Example 7. Document is in little endian UTF-16 (i.e., has BOM) but incorrectly claims to be in UTF-16LE (i.e., claims to not have a BOM). (Alternative interpretation: document is in UTF-16LE as it claims, but starts with U+FEFF character. Fatal error nonetheless.) // UTF-16 (with BOM) '<?xml version='1.0' encoding='utf-16le'?><root/> static const char content[] = "\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00=\x00'\x00" "1\x00.\x00" "0\x00'\x00 \x00" "e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00'\x00u\x00t\x00" "f\x00-\x00" "1\x00" "6\x00l\x00" "e\x00'\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>\x00"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 7: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 7: success; rejected an invalid document.\n"); Example 8. The document may look like valid ascii, but because of the byte order mark at the very beginning, it is not: static const char content[] = "\xef\xbb\xbf<?xml version='1.0' encoding='ascii'?><root/>"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 8: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 8: success; rejected an invalid document.\n"); Example 9. Change encoding on the fly, ascii -> utf-32. static const char content[] = "<?xml version='1.0' encoding='utf-32'\x00\x00\x00?\x00\x00\x00>\x00\x00\x00<\x00\x00\x00r\x00\x00\x00o\x00\x00\x00o\x00\x00\x00t\x00\x00\x00/\x00\x00\x00>"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 9: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 9: success; rejected an invalid document.\n"); Example 10. Change encoding on the fly, ascii -> cp424 (EBCDIC). static const char content[] = "<?xml version='1.0' encoding='cp424'onL\x99\x96\x96\xa3\x61n"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 10: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 10: success; rejected an invalid document.\n"); Example 11. Here we have an surrogate pair which is valid in UTF-16 but invalid in UCS-2. // UTF-16 encoded '<?xml version="1.0" encoding="UCS-2"?><U+10000/>' static const char content[] = "\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00=\x00\"\x00" "1\x00.\x00" "0\x00\"\x00 \x00" "e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00\"\x00U\x00" "C\x00S\x00-\x00" "2\x00\"\x00?\x00>\x00<\x00\x00\xd8\x00\xdc/\x00>\x00"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc != NULL) fprintf(stdout, "Ex 11: failure; accepted an invalid document.\n"); else fprintf(stdout, "Ex 11: success; rejected an invalid document.\n"); Patch 5 addresses these problems. There is a small glitch of not growing the input buffer at the right time. This sometimes leads to errors; parsing the perfectly valid document in the next example fails. Example 12: // UTF-16LE encoded '<?xml version = "1.0" encoding = "utf-16le"?><root/>' static const char content[] = "<\x00?\x00x\x00m\x00l\x00 \x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00 \x00=\x00 \x00\"\x00" "1\x00.\x00" "0\x00\"\x00 \x00" "e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00 \x00=\x00 \x00\"\x00u\x00t\x00" "f\x00-\x00" "1\x00" "6\x00l\x00" "e\x00\"\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>\x00"; int length = sizeof(content); xmlDocPtr doc; doc = xmlReadMemory(content, length, "noname.xml", NULL, 0); if (doc == NULL) fprintf(stdout, "Ex 12: failure; failed to parse valid document.\n"); else fprintf(stdout, "Ex 12: success; parsed valid document.\n"); Patch 6 sets this right. XML standard allows an empty external entity, and external entity may start with a BOM. However a BOM in an empty external entity confuses libxml2, which assumes that BOM may only occur in a string of at least 4 bytes. UTF-16 BOM causes parsing to fail, and UTF-8 BOM is interpreted as a #xFEFF character. Patch 7 fixes this bugs, and also simplifies the code by avoiding unnecessary copying of data. Patch 8 simplifies some unneccessary complicated encoding processing in HTMLparser.c and some minor things elsewhere. Patch 9 implements HTML 5 encoding detection algorithm, which is more extensive and robust than the current encoding sniffing algorithm in HTMLparser.c. For example, it ignores commented out declaration. Anyway it is not used by default, only when options instruct to do so. I can't say I really like the new code. It is convoluted and repeats itself. However without breaking backwards compatibility I could do no better. I'd welcome feedback especially about these questions: Exactly when should we use ctxt->encoding and when ctxt->input->encoding? xmlSwitchEncoding() in parserInternals.c assumed that input in UTF-16LE UTF-16BE might contain UTF-8 BOM "As we expect this function to be called after xmlCharEncInFunc". Why? xmlCharEncInFunc() seems to be never called. Also, xmlCharEncInFunc has already been decoded (why else would the BOM be in UTF-8?), can xmlSwitchEncoding() just set out to decoding it again as it does. Overall this just seems wrong, but there may be something I missed. Does UCS-2 have different schemas, with/without BOM, like UTF-16? How about UCS-4? XML standard says that UTF-16 must have BOM. Should missing BOM be a XML_ERR_WARNING, XML_ERR_ERROR, or something else? Attached you will find the patches mentioned above, all the code examples given above, and an example xml associated with one example. Regards Olli Pottonen
Attachment:
bugdemo.c
Description: Binary data
Attachment:
ext.xml
Description: application/xml
Attachment:
patch1.txt
Description: Text document
Attachment:
patch2.txt
Description: Text document
Attachment:
patch3.txt
Description: Text document
Attachment:
patch4.txt
Description: Text document
Attachment:
patch5.txt
Description: Text document
Attachment:
patch6.txt
Description: Text document
Attachment:
patch7.txt
Description: Text document
Attachment:
patch8.txt
Description: Text document
Attachment:
patch9.txt
Description: Text document