[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] UTF-8 decoding bug in HTML parser
- From: Daniel Veillard <veillard redhat com>
- To: Michael Day <mikeday yeslogic com>
- Cc: xml gnome org
- Subject: Re: [xml] UTF-8 decoding bug in HTML parser
- Date: Fri, 26 Sep 2008 12:06:22 +0200
On Fri, Sep 26, 2008 at 02:44:19PM +1000, Michael Day wrote:
> Hi Daniel,
>
>> See patch attached, i'm commiting it to SVN as this fixes the specific
>> test case, all the errors seen when parsing subsequently looks 'normal'
>> :-) so I added it to the test suite
>
> Excellent!
>
> Would there be any chance that you could look at one more related issue
> affecting the HTML parser? Currently if a HTML file begins with a UTF-8
> BOM, the HTML parser does not recognise it and parses it as three Latin1
> characters, which results in garbage at the beginning of the file and an
> incorrect encoding for the rest of the file.
>
> Would it be possible to skip over these three bytes, and ideally set the
> encoding to UTF-8 if they are present?
Reusing the XML code for this seems to work fine for em and the
regression test, but you have probably a more extensive HTML test
suite than me ;-) so raise the problem if there is a regression !
Will commit to SVN with the test case,
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
Index: HTMLparser.c
===================================================================
--- HTMLparser.c (revision 3797)
+++ HTMLparser.c (working copy)
@@ -4120,6 +4120,8 @@ htmlParseElement(htmlParserCtxtPtr ctxt)
int
htmlParseDocument(htmlParserCtxtPtr ctxt) {
+ xmlChar start[4];
+ xmlCharEncoding enc;
xmlDtdPtr dtd;
xmlInitParser();
@@ -4139,6 +4141,23 @@ htmlParseDocument(htmlParserCtxtPtr ctxt
if ((ctxt->sax) && (ctxt->sax->setDocumentLocator))
ctxt->sax->setDocumentLocator(ctxt->userData, &xmlDefaultSAXLocator);
+ if ((ctxt->encoding == (const xmlChar *)XML_CHAR_ENCODING_NONE) &&
+ ((ctxt->input->end - ctxt->input->cur) >= 4)) {
+ /*
+ * Get the 4 first bytes and decode the charset
+ * if enc != XML_CHAR_ENCODING_NONE
+ * plug some encoding conversion routines.
+ */
+ start[0] = RAW;
+ start[1] = NXT(1);
+ start[2] = NXT(2);
+ start[3] = NXT(3);
+ enc = xmlDetectCharEncoding(&start[0], 4);
+ if (enc != XML_CHAR_ENCODING_NONE) {
+ xmlSwitchEncoding(ctxt, enc);
+ }
+ }
+
/*
* Wipe out everything which is before the first '<'
*/
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]