Re: [xml] HTMLparser comment parsing bug and patch

From: Daniel Veillard <veillard redhat com>
To: Nick Kew <nick webthing com>
Cc: "William M. Brack" <wbrack mmm com hk>, xml gnome org
Subject: Re: [xml] HTMLparser comment parsing bug and patch
Date: Wed, 30 Jul 2003 11:28:57 -0400

On Tue, Jul 29, 2003 at 11:48:54PM +0100, Nick Kew wrote:

This code is used when doing progressive parsing.


I'm seeing it in the pushParser.


  right, same thing, different way to name it :-)


example

looking for "</" current chunk being
  "<a> start <!-- </a> --> not finished "


OK, I can see that.

What I don't see is why it should look inside the comment there:

<a>                   -> startElement event
 start                        -> characters event terminated by <
<!-- </a> -->         -> comment event

must return false. If you remove the associated code as your patch
suggest it will return true which is wrong w.r.t. the function semantic.


You mean it has to look for </a>?  That doesn't make sense to me in
a SAX parser - and AFAICS the logic is the same for DOM except that
"event" is replaced by "node" in the above analysis.


  it was an example to demonstrate the semantic of the function. If the
sequence seached for is embedded into a comment then the function must not
return with a positive return code until it finds it outside of a comment.

The sample I diagnosed this on is the MySQL manual (selected because it
was an example of a very large HTML file I had to hand).  The TOC
shows the same behaviour:

  <HTML>
  <HEAD>
  <!-- This HTML file has been created by texi2html 1.52 (hacked by
david detron se)
     from manual.texi on 15 March 2003 -->

  <TITLE>MySQL Reference Manual for version 4.0.12. - Table of Contents</TITLE>
  ......


The <HTML> and <HEAD> events go through correctly, but the comment
then blocks all further parsing (several megabytes).


  There might be a bug in the code, but I'm defending the fact that
a sequence embedded in a comment must not influence the analyze of the
markup.

The bug may in fact be that it shouldn't be calling
htmlParseLookupSequence to find the end of a comment in the first place.


  This might be the bug, surprizing because I would then expect it
to break on all comments embedded in HTML content, and as far as I
can tell the regression tests check that and show no behaviour difference
between the normal parsing and the progressive parsing, which seems
to imply this is actually working.

(BTW, technically there are deeper errors in the HTML comment parsing,
but since they correspond reasonably well to 'normal' browser behaviour
I wouldn't suggest worrying too much about them).


  If you know about errors report them. Saying "there is errors but
I won't dare telling you" sounds just improper for such a project, sorry !

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Follow-Ups:
- Re: [xml] HTMLparser comment parsing bug and patch
  - From: Nick Kew

References:
- Re: [xml] HTMLparser comment parsing bug and patch
  - From: Daniel Veillard
- Re: [xml] HTMLparser comment parsing bug and patch
  - From: Nick Kew

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]