On Fri, Sep 26, 2008 at 02:44:19PM +1000, Michael Day wrote:
Hi Daniel,See patch attached, i'm commiting it to SVN as this fixes the specific test case, all the errors seen when parsing subsequently looks 'normal' :-) so I added it to the test suiteExcellent! Would there be any chance that you could look at one more related issue affecting the HTML parser? Currently if a HTML file begins with a UTF-8 BOM, the HTML parser does not recognise it and parses it as three Latin1 characters, which results in garbage at the beginning of the file and an incorrect encoding for the rest of the file. Would it be possible to skip over these three bytes, and ideally set the encoding to UTF-8 if they are present?
Reusing the XML code for this seems to work fine for em and the regression test, but you have probably a more extensive HTML test suite than me ;-) so raise the problem if there is a regression ! Will commit to SVN with the test case, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/
Attachment:
html_utf8_bom.patch
Description: Text document