Re: [xml] HTMLparser comment parsing bug and patch



On Tue, 29 Jul 2003, Daniel Veillard wrote:

On Tue, Jul 29, 2003 at 10:47:37PM +0100, Nick Kew wrote:
No, it doesn't fix the problem.  Your patch now sets "incomment"
until it reaches the end of the comment being parsed - which means
it's gone past the sequence it's looking for.  So it has exactly
the same problem, made worse by the fact that it's doing more parsing.

Can you explain what the purpose of the "incomment" stuff is?
Under what circumstances does it want to to look past a comment
for a token?

Please do not suggest a patch if you don't understand the modified
code !

I said

This code is used when doing progressive parsing.

I'm seeing it in the pushParser.

       The progressive parser
need to get to the end of sequence. The modified function is there to
check that the current chunk contains a full sequence.  The parameter
indicates the sequence of characters to look for to detect the
end and hence allowing to hand off the chunk to the parser. If the
sequence is embedded in a comment it must not be considered as being
present, it doesn't exist from a markup point of view.

Yep, fine so far.


example

looking for "</" current chunk being
  "<a> start <!-- </a> --> not finished "

OK, I can see that.

What I don't see is why it should look inside the comment there:

<a>                     -> startElement event
 start                  -> characters event terminated by <
<!-- </a> -->           -> comment event

must return false. If you remove the associated code as your patch
suggest it will return true which is wrong w.r.t. the function semantic.

You mean it has to look for </a>?  That doesn't make sense to me in
a SAX parser - and AFAICS the logic is the same for DOM except that
"event" is replaced by "node" in the above analysis.


  Your patch is wrong. It is possible that the incoming HTML is just too
broken but it's impossible to tell without getting it. Please provide it
as the bug report guideline asks.

The sample I diagnosed this on is the MySQL manual (selected because it
was an example of a very large HTML file I had to hand).  The TOC
shows the same behaviour:

  <HTML>
  <HEAD>
  <!-- This HTML file has been created by texi2html 1.52 (hacked by
david detron se)
     from manual.texi on 15 March 2003 -->

  <TITLE>MySQL Reference Manual for version 4.0.12. - Table of Contents</TITLE>
  ......


The <HTML> and <HEAD> events go through correctly, but the comment
then blocks all further parsing (several megabytes).

The bug may in fact be that it shouldn't be calling
htmlParseLookupSequence to find the end of a comment in the first place.

(BTW, technically there are deeper errors in the HTML comment parsing,
but since they correspond reasonably well to 'normal' browser behaviour
I wouldn't suggest worrying too much about them).

-- 
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]