Re: [xml] HTMLparser: SGML comments



On Tue, Nov 15, 2005 at 11:38:33AM +1100, Michael Day wrote:

Hi Daniel,

The invalid comment in wired.html is this:

    <!------TRADES--------->

Because it has an odd number of "--" sequences the comment is actually not
terminated according to the SGML rules.

Web browsers will actually parse this comment differently depending on
whether they are using standards-mode or quirks-mode to parse the
document.

I have attached an HTML document that demonstrates the issue. If you open
it in Mozilla, it will be parsed in standards-mode because it has a
DOCTYPE declaration. In this case the comment will not be terminated and
some of the document text will be hidden. If you delete the DOCTYPE it
will be parsed in quirks-mode, the comment will be terminated and the text
will be shown.

I cannot think of any way to detect comment termination that will handle
both cases correctly without adding a quirks-mode feature to the libxml
HTMLparser; there is no other way to parse old HTML and new HTML and get
them both right.

Would it be reasonable for me to add a quirks-mode flag to the HTML parser
that would only toggle comment parsing behaviour for now?

  I would just use the existing HTML_PARSE_RECOVER mode flag for this, though
in a sense I would have preferred the default behaviour to be maintained.
I really think that a wrong count number of '-' in comments is a frequent
mistake and even if SGML suggest it is not ended we should not miss the
start tag on the next element. This is a too benign error, and the effects
are too strong with the new code, this feels unbalanced especially as it
is a change from the current behaviour.
  I don't know how to best handle this...

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]