Re: [xml] HTMLparser: SGML comments




Hi Daniel,

The invalid comment in wired.html is this:

    <!------TRADES--------->

Because it has an odd number of "--" sequences the comment is actually not
terminated according to the SGML rules.

Web browsers will actually parse this comment differently depending on
whether they are using standards-mode or quirks-mode to parse the
document.

I have attached an HTML document that demonstrates the issue. If you open
it in Mozilla, it will be parsed in standards-mode because it has a
DOCTYPE declaration. In this case the comment will not be terminated and
some of the document text will be hidden. If you delete the DOCTYPE it
will be parsed in quirks-mode, the comment will be terminated and the text
will be shown.

I cannot think of any way to detect comment termination that will handle
both cases correctly without adding a quirks-mode feature to the libxml
HTMLparser; there is no other way to parse old HTML and new HTML and get
them both right.

Would it be reasonable for me to add a quirks-mode flag to the HTML parser
that would only toggle comment parsing behaviour for now?

Cheers,

Michael

-- 
Print XML with Prince!
http://www.princexml.com

There will be text after this paragraph in quirks mode.

This will be hidden in standards-mode.-->

There will be text before this paragraph in quirks mode.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]