Re: [xml] HTMLparser comment parsing bug and patch

On Wed, Jul 30, 2003 at 07:28:08PM +0100, Nick Kew wrote:
So I still can't see a legitimate use for "incomment".

  Well it's the semantic of the function, maybe in a more global context
it's not used but it is how it's defined. 
  Anyway William Brake came up with another patch which seems correct
and he was able to reproduce the problem and fix it.
  I will have it commited soon.

(BTW, technically there are deeper errors in the HTML comment parsing,
but since they correspond reasonably well to 'normal' browser behaviour
I wouldn't suggest worrying too much about them).

  If you know about errors report them. Saying "there is errors but
I won't dare telling you" sounds just improper for such a project, sorry !

What I meant is that it's parsing based on XML comment syntax, which
is not the same as SGML.  Technically in HTML,

<p>In this (valid) HTML paragraph,
<!-- this is a comment -- but this is outside the comment

  so -- is interpreted as a comment ending tag in this context
--> and this
is another comment

  so --> is interpreted as a comment start ???

--> and the second "-->" ends the comment declaration.

  Hum, I can't make any sense of this, nesting is understandable 
but where is the opening comment ?

When I say I wouldn't suggest worrying too much about them, I had in mind
also other HTML technicalities, such as SHORTTAGS and NET-enabling start
tags, as demonstrated by this (valid) HTML 4.01 document:

      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
        <title/HTML example/
        <p<p/This is the second paragraph in this document.
        The first was empty.

When I want to parse this formally correctly I use OpenSP, and I thought
it might be considered out of proportion for HTMLparser to deal rigorously
with the finer points of SGML.

  No surprise that no Web browser ever used a real SGML parser and
that as a result we ended up with that terrible mess that is "Web HTML".
Better to burry those stinky remains and try to get onto something more
sound. You're right I won't be able to make sense of this, and not
many web tools will handle it any better, lets not try to get there.
  I'm really surprized though that the HTML Working Group kept the full
support for minimization in HTML, maybe SGML had no intermediate setting
to limit it just to optional end of tags, I dunno, that's just scary...

   thanks for the report,


Daniel Veillard      | Red Hat Network
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]