Re: [xml] HTMLparser: UTF-8 byte order mark
- From: Daniel Veillard <veillard redhat com>
- To: Michael Day <mikeday yeslogic com>
- Cc: xml gnome org
- Subject: Re: [xml] HTMLparser: UTF-8 byte order mark
- Date: Mon, 2 Jan 2006 04:59:14 -0500
On Thu, Dec 29, 2005 at 03:12:48PM +1100, Michael Day wrote:
Hi,
Sorry for the delay, I got caught in end of year stuff and didn't
spent time at the computer :-)
The HTMLparser chokes on HTML files that begin with a UTF-8 byte order
mark. This is unfortunate, as files edited with Notepad can easily end up
with a byte order mark at the start if saved with UTF-8 encoding.
Hum, I don't know how it should be processed in theory ! In XML
the BOM is fine at the beginning of a document entity in UTF-8 or UTF-16
but will usually mess things up in different encodings. For HTML I don't
know what the theory suggests. For compatibility I guess the character
should be dropped if detected.
Any tips on what would be the best way to handle this in HTMLparser.c?
Probably in htmlParseDocument() in HTMLParser.c, create a separate function
to skip a BOM found there (static htmlSkipBOM(htmlParserCtxtPtr ctxt)) and
call it just before this block:
-------------------------
/*
* Wipe out everything which is before the first '<'
*/
SKIP_BLANKS;
if (CUR == 0) {
htmlParseErr(ctxt, XML_ERR_DOCUMENT_EMPTY,
"Document is empty\n", NULL, NULL);
}
-------------------------
and in htmlParseTryOrFinish(), in the 'case XML_PARSER_START', call
the function just at the beginning, before
--------------------------
/*
* Very first chars read from the document flow.
*/
cur = in->cur[0];
--------------------------
That sounds like the best approach, and I guess it should then work similary
for the normal parser and the push one. Testing with a few documents in both
mode, with both UTF-8 and UTF-16 should ensure the approach is right :-)
Daniel
--
Daniel Veillard | Red Hat http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]