Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border

From: "Cyrill Osterwalder" <Cyrill Osterwalder visonys com>
To: <veillard redhat com>, <xml gnome org>
Cc: Cyrill Osterwalder <Cyrill Osterwalder visonys com>
Subject: Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
Date: Thu, 22 Jun 2006 10:57:02 +0200


I suppose I found the reason why chunked CDATA parsing also fails
without the special recovery mode:

If the chunk actually ends with "</", then htmlParseTryOrFinish() calls
htmlParseScript() to process it. In there, the normal break condition is
coded as follows:

if ((cur == '<') && (NXT(1) == '/')) {
    if (((NXT(2) >= 'A') && (NXT(2) <= 'Z')) ||
        ((NXT(2) >= 'a') && (NXT(2) <= 'z')))
    {
        break; /* while */
    }
}

However, NXT(2) is not guaranteed to be available. So it will not break
but consume the "</", which leads to a broken CDATA parsing in all
cases, even without PARSE_HTML_RECOVER being set. This could be solved
by avoiding calling htmlParseScript() with a chunk ending with "</". 

The case with the CDATA recovery option is even more complicated.

I wonder what you think if we would check in htmlParseTryOrFinish() that
the last 8 characters of the chunk do not include "</" before calling
htmlParseScript() in order to solve both cases? Assuming we are in a
CDATA block being followed by at least one real end tag and other tags
afterwards this should be safe, shouldn't it?

Cyrill

PS: Please let me know if such detailed source code discussions are not
supposed to be done on the list

Follow-Ups:
- Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]