Re: [xml] HTML Parser problems with chunk parser if HTML keywords overlap chunk border



On Wed, Jun 21, 2006 at 04:29:56PM +0200, Cyrill Osterwalder wrote:
Hi all

After some more research I believe to have found the reason for the
problem with the CDATA parsing. In case PARSE_HTML_RECOVER is true, the
following criteria in htmlParseTryOrFinish() is not enough for calling
htmlParseScript():

/*
 * Handle SCRIPT/STYLE separately
 */
if ((!terminate) &&
    (htmlParseLookupSequence(ctxt, '<', '/', 0, 0) < 0))
        goto done;
htmlParseScript(ctxt);


This code makes sure that there is an end tag starting somewhere in the
buffer that is going to be processed by htmlParseScript(). However, in
recovery mode, htmlParseScript() will consume the "</" characters if the
real CDATA end tag is not fully inside the current chunk (like described
in the problem report). 

  True. I was think about something like that. This is all due to 
script and style having different parsing constraints.
  Why do you use PARSE_HTML_RECOVER ? The parser is already doing recovery
mode to some extend without them (I mean the HTML parser :-).

I don't have a patch recommendation for the moment but I see two
possibilities:

a) htmlParseTryOrFinish() could guarantee that the buffer contains the
desired close tag (or terminate is true). I guess that this could be
done using multiple htmlParseLookupSequence() calls and checking for the
tag name in a loop...?

  Hum, well we could check for the current element and make 2 specific
tests in that case. This would be very hard anywy people are gonna come
with '</ style' or '</foo> and expect taht to close the open tag, and
 'style "</" style' and expect to not close it...
  
b) htmlParseScript would have to be more powerful in order to recognize
that it is trying to do xmlStrncasecmp() on an incomplete tag string. In
that case it should break and be called again by htmlParseTryOrFinish().
That on the other hand would have to be more careful with the switch to
the end tag processing after the call to htmlParseScript().

  Not sure it's much better

Possibility a) looks better to me and might try to implement a patch
example.

  You can try, but it's all very messy IMHO, I will take patches if not
obviously broken (could be a good idea to provide examples for the test
suite too).

   thanks

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]