Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
- From: "Cyrill Osterwalder" <Cyrill Osterwalder visonys com>
- To: <veillard redhat com>, <xml gnome org>
- Cc: Cyrill Osterwalder <Cyrill Osterwalder visonys com>
- Subject: Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
- Date: Thu, 22 Jun 2006 10:57:02 +0200
I suppose I found the reason why chunked CDATA parsing also fails
without the special recovery mode:
If the chunk actually ends with "</", then htmlParseTryOrFinish() calls
htmlParseScript() to process it. In there, the normal break condition is
coded as follows:
if ((cur == '<') && (NXT(1) == '/')) {
if (((NXT(2) >= 'A') && (NXT(2) <= 'Z')) ||
((NXT(2) >= 'a') && (NXT(2) <= 'z')))
{
break; /* while */
}
}
However, NXT(2) is not guaranteed to be available. So it will not break
but consume the "</", which leads to a broken CDATA parsing in all
cases, even without PARSE_HTML_RECOVER being set. This could be solved
by avoiding calling htmlParseScript() with a chunk ending with "</".
The case with the CDATA recovery option is even more complicated.
I wonder what you think if we would check in htmlParseTryOrFinish() that
the last 8 characters of the chunk do not include "</" before calling
htmlParseScript() in order to solve both cases? Assuming we are in a
CDATA block being followed by at least one real end tag and other tags
afterwards this should be safe, shouldn't it?
Cyrill
PS: Please let me know if such detailed source code discussions are not
supposed to be done on the list
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]