Re: [xml] HTMLparser: UTF-8 byte order mark



On Thu, Dec 29, 2005 at 03:12:48PM +1100, Michael Day wrote:

Hi,

  Sorry for the delay, I got caught in end of year stuff and didn't
spent time at the computer :-)

The HTMLparser chokes on HTML files that begin with a UTF-8 byte order
mark. This is unfortunate, as files edited with Notepad can easily end up
with a byte order mark at the start if saved with UTF-8 encoding.

  Hum, I don't know how it should be processed in theory ! In XML
the BOM is fine at the beginning of a document entity in UTF-8 or UTF-16
but will usually mess things up in different encodings. For HTML I don't
know what the theory suggests. For compatibility I guess the character
should be dropped if detected.

Any tips on what would be the best way to handle this in HTMLparser.c?

  Probably in htmlParseDocument() in HTMLParser.c, create a separate function
to skip a BOM found there (static htmlSkipBOM(htmlParserCtxtPtr ctxt)) and
call it just before this block:

-------------------------
    /*
     * Wipe out everything which is before the first '<'
     */
    SKIP_BLANKS;
    if (CUR == 0) {
        htmlParseErr(ctxt, XML_ERR_DOCUMENT_EMPTY,
                     "Document is empty\n", NULL, NULL);
    }
-------------------------

  and in htmlParseTryOrFinish(), in the 'case XML_PARSER_START', call
the function just at the beginning, before

--------------------------
                /*
                 * Very first chars read from the document flow.
                 */
                cur = in->cur[0];
--------------------------

  That sounds like the best approach, and I guess it should then work similary
for the normal parser and the push one. Testing with a few documents in both
mode, with both UTF-8 and UTF-16 should ensure the approach is right :-)

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]