Re: [xml] HTMLparser comment parsing bug and patch



On Wed, 30 Jul 2003, Daniel Veillard wrote:

  Anyway William Brake came up with another patch which seems correct
and he was able to reproduce the problem and fix it.

I don't recollect seeing that.  His first patch doesn't fix it.

<p>In this (valid) HTML paragraph,
<!-- this is a comment -- but this is outside the comment

  so -- is interpreted as a comment ending tag in this context

It ends the comment, but not the comment declaration.


--> and this
is another comment

  so --> is interpreted as a comment start ???

Not quite.  "--" within a comment declaration starts another comment,
and the ">" is just a character within the comment.

--> and the second "-->" ends the comment declaration.

That one's what it looks like:-)

  No surprise that no Web browser ever used a real SGML parser and

Hmmm, emacs, qweb, something-on-Mac.  Not AFAIK a long list.

that as a result we ended up with that terrible mess that is "Web HTML".

Dealing with "Web HTML" is precisely where libxml's HTMLparser is useful.

  I'm really surprized though that the HTML Working Group kept the full
support for minimization in HTML, maybe SGML had no intermediate setting
to limit it just to optional end of tags, I dunno, that's just scary...

Some of us have raised that issue with them, but their only answer begins
with an X.  The supposed reason for the SHORTTAGS is to  allow attribute
minimisation (things like <option selected> for <option selected=selected>),
but it's perfectly possible to separate that from the more troublesome
minimisations.  My practical solution is described at
<URL:http://valet.webthing.com/page/parsemode.html>.

-- 
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]