[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] HTML Parser problems with chunk parser if HTML keywords overlap chunk border
- From: Daniel Veillard <veillard redhat com>
- To: Cyrill Osterwalder <Cyrill Osterwalder visonys com>
- Cc: xml gnome org
- Subject: Re: [xml] HTML Parser problems with chunk parser if HTML keywords overlap chunk border
- Date: Wed, 21 Jun 2006 10:55:36 -0400
On Wed, Jun 21, 2006 at 04:29:56PM +0200, Cyrill Osterwalder wrote:
> Hi all
>
> After some more research I believe to have found the reason for the
> problem with the CDATA parsing. In case PARSE_HTML_RECOVER is true, the
> following criteria in htmlParseTryOrFinish() is not enough for calling
> htmlParseScript():
>
> /*
> * Handle SCRIPT/STYLE separately
> */
> if ((!terminate) &&
> (htmlParseLookupSequence(ctxt, '<', '/', 0, 0) < 0))
> goto done;
> htmlParseScript(ctxt);
>
>
> This code makes sure that there is an end tag starting somewhere in the
> buffer that is going to be processed by htmlParseScript(). However, in
> recovery mode, htmlParseScript() will consume the "</" characters if the
> real CDATA end tag is not fully inside the current chunk (like described
> in the problem report).
True. I was think about something like that. This is all due to
script and style having different parsing constraints.
Why do you use PARSE_HTML_RECOVER ? The parser is already doing recovery
mode to some extend without them (I mean the HTML parser :-).
> I don't have a patch recommendation for the moment but I see two
> possibilities:
>
> a) htmlParseTryOrFinish() could guarantee that the buffer contains the
> desired close tag (or terminate is true). I guess that this could be
> done using multiple htmlParseLookupSequence() calls and checking for the
> tag name in a loop...?
Hum, well we could check for the current element and make 2 specific
tests in that case. This would be very hard anywy people are gonna come
with '</ style' or '</foo> and expect taht to close the open tag, and
'style "</" style' and expect to not close it...
> b) htmlParseScript would have to be more powerful in order to recognize
> that it is trying to do xmlStrncasecmp() on an incomplete tag string. In
> that case it should break and be called again by htmlParseTryOrFinish().
> That on the other hand would have to be more careful with the switch to
> the end tag processing after the call to htmlParseScript().
Not sure it's much better
> Possibility a) looks better to me and might try to implement a patch
> example.
You can try, but it's all very messy IMHO, I will take patches if not
obviously broken (could be a good idea to provide examples for the test
suite too).
thanks
Daniel
--
Daniel Veillard | Red Hat http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]