Re: [xml] HTML Parser problems with chunk parser if HTML keywords overlap chunk border



Hi all

After some more research I believe to have found the reason for the
problem with the CDATA parsing. In case PARSE_HTML_RECOVER is true, the
following criteria in htmlParseTryOrFinish() is not enough for calling
htmlParseScript():

/*
 * Handle SCRIPT/STYLE separately
 */
if ((!terminate) &&
    (htmlParseLookupSequence(ctxt, '<', '/', 0, 0) < 0))
        goto done;
htmlParseScript(ctxt);


This code makes sure that there is an end tag starting somewhere in the
buffer that is going to be processed by htmlParseScript(). However, in
recovery mode, htmlParseScript() will consume the "</" characters if the
real CDATA end tag is not fully inside the current chunk (like described
in the problem report). 

I don't have a patch recommendation for the moment but I see two
possibilities:

a) htmlParseTryOrFinish() could guarantee that the buffer contains the
desired close tag (or terminate is true). I guess that this could be
done using multiple htmlParseLookupSequence() calls and checking for the
tag name in a loop...?

b) htmlParseScript would have to be more powerful in order to recognize
that it is trying to do xmlStrncasecmp() on an incomplete tag string. In
that case it should break and be called again by htmlParseTryOrFinish().
That on the other hand would have to be more careful with the switch to
the end tag processing after the call to htmlParseScript().

Possibility a) looks better to me and might try to implement a patch
example.

Cyrill



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]