Re: [xml] HTMLparser: SGML comments

On Wed, Nov 09, 2005 at 03:10:11PM +1100, Michael Day wrote:


HTMLparser currently parses comments by looking for a --> to end the
comment. However, this does not handle SGML comments, in which -- is used
to toggle whether > ends the comment. It is possible for an SGML comment
to look like this:

    <!-- Hel>lo -- world --> good>bye -- world >

The whole thing is one comment, broken down like this:

    "<!--"          starts the comment
    " Hel>lo "      comment text ('>' is treated as text)
    "--"            toggles state ('>' will end the comment)
    " world "       comment text
    "--"            toggles state ('>' will be treated as text)
    "> good>bye "   comment text ('>' is treated as text)
    "--"            toggles state ('>' will end the comment)
    " world "       comment text
    ">"             ends the comment

This looks pretty scary, but this is how Mozilla handles HTML comments in
standards mode and Opera is going to do the same. The Acid2 test from the
Web Standards Project includes an SGML comment:

For further info on SGML comments in HTML, see:

I have a patch for HTMLparser.c to make it parse SGML comments. It also
strips "--" from the text of the comment node, which is different from the
existing behaviour:

    <!-- Hello -->
    comment(" Hello ")                // identical to old behaviour

    <!-- Hello ---- world -->
    comment(" Hello  world ") // old behaviour includes "----"

    <!-- Hello -- --> -- world >
    comment(" Hello  >  world ")

Stripping out the "--" from the text of the comment node also makes it
possible to take documents that were parsed by HTMLparser and serialise
them as well-formed XML, which is sometimes not possible now.

Would this patch be acceptable?

  Sounds a good idea to fix the parser bahaviour to be more correct, yes.
I don't really know SGML, so such patches are welcome. I just have one
problem with the code, it calls GROW only when the end of the buffer is
detected with a NUL, I would rather have it called more preemtively to
in the loop to avoid a potential weakness in the case of multibyte chars.
  Note also that I prefer patches than cut an paste of full routines, it
gives me the context of what was changed.

    thanks !


Daniel Veillard      | Red Hat
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]