Re: [xml] HTMLparser comment parsing bug and patch



On Wed, 30 Jul 2003, Daniel Veillard wrote:

example

looking for "</" current chunk being
  "<a> start <!-- </a> --> not finished "

OK, I can see that.

What I don't see is why it should look inside the comment there:

<a>                 -> startElement event
 start                      -> characters event terminated by <
<!-- </a> -->               -> comment event

must return false. If you remove the associated code as your patch
suggest it will return true which is wrong w.r.t. the function semantic.

You mean it has to look for </a>?  That doesn't make sense to me in
a SAX parser - and AFAICS the logic is the same for DOM except that
"event" is replaced by "node" in the above analysis.

  it was an example to demonstrate the semantic of the function. If the
sequence seached for is embedded into a comment then the function must not
return with a positive return code until it finds it outside of a comment.

But why should it look beyond the start of the comment?  The point here
is that a comment must be a new event (or node), no matter where it
arises in the markup.  Your example doesn't work because a startElement
can be followed by other startElements, characters, etc, and the parser
looks for "<", not "</".

This is where the bug arises: it finds "<" and emits a characters event
for what came before.  Next call to htmlParseLookupSequence is looking
for "-->".  But it gets flagged as "incomment" and never sees the
end-of-comment it's looking for.

AFAICS the only purpose for these semantics would be to try error-
correcting <a <!-- this is supposed to be inside the start tag --> >.
Since that fragment is valid (and normalises to <a><!-- this is ... --> >
 - the final ">" is character data), such error-correction would be wrong.
So I still can't see a legitimate use for "incomment".


  This might be the bug, surprizing because I would then expect it
to break on all comments embedded in HTML content, and as far as I
can tell the regression tests check that and show no behaviour difference
between the normal parsing and the progressive parsing, which seems
to imply this is actually working.

There may be no observed behaviour difference.  But I was merely debugging
a problem reported by my correspondent, who reported it taking 28 minutes
to parse an 8Mb HTML file.  My patch reduced his parse time to 9 seconds.
I had not observed this in the past because I haven't run tests on
megabyte-size files containing comments.

(BTW, technically there are deeper errors in the HTML comment parsing,
but since they correspond reasonably well to 'normal' browser behaviour
I wouldn't suggest worrying too much about them).

  If you know about errors report them. Saying "there is errors but
I won't dare telling you" sounds just improper for such a project, sorry !

What I meant is that it's parsing based on XML comment syntax, which
is not the same as SGML.  Technically in HTML,

<p>In this (valid) HTML paragraph,
<!-- this is a comment -- but this is outside the comment --> and this
is another comment --> and the second "-->" ends the comment declaration.
</p>

When I say I wouldn't suggest worrying too much about them, I had in mind
also other HTML technicalities, such as SHORTTAGS and NET-enabling start
tags, as demonstrated by this (valid) HTML 4.01 document:

        <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
        <title/HTML example/
        <p<p/This is the second paragraph in this document.
        The first was empty.

When I want to parse this formally correctly I use OpenSP, and I thought
it might be considered out of proportion for HTMLparser to deal rigorously
with the finer points of SGML.

-- 
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]