[xml] HTMLparser: SGML comments




Hi,

HTMLparser currently parses comments by looking for a --> to end the
comment. However, this does not handle SGML comments, in which -- is used
to toggle whether > ends the comment. It is possible for an SGML comment
to look like this:

    <!-- Hel>lo -- world --> good>bye -- world >

The whole thing is one comment, broken down like this:

    "<!--"          starts the comment
    " Hel>lo "      comment text ('>' is treated as text)
    "--"            toggles state ('>' will end the comment)
    " world "       comment text
    "--"            toggles state ('>' will be treated as text)
    "> good>bye "   comment text ('>' is treated as text)
    "--"            toggles state ('>' will end the comment)
    " world "       comment text
    ">"             ends the comment

This looks pretty scary, but this is how Mozilla handles HTML comments in
standards mode and Opera is going to do the same. The Acid2 test from the
Web Standards Project includes an SGML comment:

    http://www.webstandards.org/act/acid2/

For further info on SGML comments in HTML, see:

    http://www.howtocreate.co.uk/SGMLComments.html

I have a patch for HTMLparser.c to make it parse SGML comments. It also
strips "--" from the text of the comment node, which is different from the
existing behaviour:

    <!-- Hello -->
    comment(" Hello ")          // identical to old behaviour

    <!-- Hello ---- world -->
    comment(" Hello  world ")   // old behaviour includes "----"

    <!-- Hello -- --> -- world >
    comment(" Hello  >  world ")

Stripping out the "--" from the text of the comment node also makes it
possible to take documents that were parsed by HTMLparser and serialise
them as well-formed XML, which is sometimes not possible now.

Would this patch be acceptable?

Best regards,

Michael

-- 
Print XML with Prince!
http://www.princexml.com

Attachment: patch.c
Description: Text Data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]