[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [xml] UTF-8 decoding bug in HTML parser



On Fri, Sep 26, 2008 at 02:44:19PM +1000, Michael Day wrote:
> Hi Daniel,
>
>>   See patch attached, i'm commiting it to SVN as this fixes the specific
>> test case, all the errors seen when parsing subsequently looks 'normal'
>> :-) so I added it to the test suite
>
> Excellent!
>
> Would there be any chance that you could look at one more related issue  
> affecting the HTML parser? Currently if a HTML file begins with a UTF-8  
> BOM, the HTML parser does not recognise it and parses it as three Latin1  
> characters, which results in garbage at the beginning of the file and an  
> incorrect encoding for the rest of the file.
>
> Would it be possible to skip over these three bytes, and ideally set the  
> encoding to UTF-8 if they are present?

  Reusing the XML code for this seems to work fine for em and the
regression test, but you have probably a more extensive HTML test
suite than me ;-) so raise the problem if there is a regression !
Will commit to SVN with the test case,

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
Index: HTMLparser.c
===================================================================
--- HTMLparser.c	(revision 3797)
+++ HTMLparser.c	(working copy)
@@ -4120,6 +4120,8 @@ htmlParseElement(htmlParserCtxtPtr ctxt)
 
 int
 htmlParseDocument(htmlParserCtxtPtr ctxt) {
+    xmlChar start[4];
+    xmlCharEncoding enc;
     xmlDtdPtr dtd;
 
     xmlInitParser();
@@ -4139,6 +4141,23 @@ htmlParseDocument(htmlParserCtxtPtr ctxt
     if ((ctxt->sax) && (ctxt->sax->setDocumentLocator))
         ctxt->sax->setDocumentLocator(ctxt->userData, &xmlDefaultSAXLocator);
 
+    if ((ctxt->encoding == (const xmlChar *)XML_CHAR_ENCODING_NONE) &&
+        ((ctxt->input->end - ctxt->input->cur) >= 4)) {
+	/*
+	 * Get the 4 first bytes and decode the charset
+	 * if enc != XML_CHAR_ENCODING_NONE
+	 * plug some encoding conversion routines.
+	 */
+	start[0] = RAW;
+	start[1] = NXT(1);
+	start[2] = NXT(2);
+	start[3] = NXT(3);
+	enc = xmlDetectCharEncoding(&start[0], 4);
+	if (enc != XML_CHAR_ENCODING_NONE) {
+	    xmlSwitchEncoding(ctxt, enc);
+	}
+    }
+
     /*
      * Wipe out everything which is before the first '<'
      */


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]