Re: [xml] HTMLparser: UTF-8 byte order mark

On Thu, Dec 29, 2005 at 03:12:48PM +1100, Michael Day wrote:


  Sorry for the delay, I got caught in end of year stuff and didn't
spent time at the computer :-)

The HTMLparser chokes on HTML files that begin with a UTF-8 byte order
mark. This is unfortunate, as files edited with Notepad can easily end up
with a byte order mark at the start if saved with UTF-8 encoding.

  Hum, I don't know how it should be processed in theory ! In XML
the BOM is fine at the beginning of a document entity in UTF-8 or UTF-16
but will usually mess things up in different encodings. For HTML I don't
know what the theory suggests. For compatibility I guess the character
should be dropped if detected.

Any tips on what would be the best way to handle this in HTMLparser.c?

  Probably in htmlParseDocument() in HTMLParser.c, create a separate function
to skip a BOM found there (static htmlSkipBOM(htmlParserCtxtPtr ctxt)) and
call it just before this block:

     * Wipe out everything which is before the first '<'
    if (CUR == 0) {
        htmlParseErr(ctxt, XML_ERR_DOCUMENT_EMPTY,
                     "Document is empty\n", NULL, NULL);

  and in htmlParseTryOrFinish(), in the 'case XML_PARSER_START', call
the function just at the beginning, before

                 * Very first chars read from the document flow.
                cur = in->cur[0];

  That sounds like the best approach, and I guess it should then work similary
for the normal parser and the push one. Testing with a few documents in both
mode, with both UTF-8 and UTF-16 should ensure the approach is right :-)


Daniel Veillard      | Red Hat
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]