Hi, HTMLparser currently parses comments by looking for a --> to end the comment. However, this does not handle SGML comments, in which -- is used to toggle whether > ends the comment. It is possible for an SGML comment to look like this: <!-- Hel>lo -- world --> good>bye -- world > The whole thing is one comment, broken down like this: "<!--" starts the comment " Hel>lo " comment text ('>' is treated as text) "--" toggles state ('>' will end the comment) " world " comment text "--" toggles state ('>' will be treated as text) "> good>bye " comment text ('>' is treated as text) "--" toggles state ('>' will end the comment) " world " comment text ">" ends the comment This looks pretty scary, but this is how Mozilla handles HTML comments in standards mode and Opera is going to do the same. The Acid2 test from the Web Standards Project includes an SGML comment: http://www.webstandards.org/act/acid2/ For further info on SGML comments in HTML, see: http://www.howtocreate.co.uk/SGMLComments.html I have a patch for HTMLparser.c to make it parse SGML comments. It also strips "--" from the text of the comment node, which is different from the existing behaviour: <!-- Hello --> comment(" Hello ") // identical to old behaviour <!-- Hello ---- world --> comment(" Hello world ") // old behaviour includes "----" <!-- Hello -- --> -- world > comment(" Hello > world ") Stripping out the "--" from the text of the comment node also makes it possible to take documents that were parsed by HTMLparser and serialise them as well-formed XML, which is sometimes not possible now. Would this patch be acceptable? Best regards, Michael -- Print XML with Prince! http://www.princexml.com
Attachment:
patch.c
Description: Text Data