Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border

From: Daniel Veillard <veillard redhat com>
To: Cyrill Osterwalder <Cyrill Osterwalder visonys com>
Cc: xml gnome org
Subject: Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
Date: Thu, 22 Jun 2006 05:24:19 -0400

On Thu, Jun 22, 2006 at 10:57:02AM +0200, Cyrill Osterwalder wrote:


I suppose I found the reason why chunked CDATA parsing also fails
without the special recovery mode:

If the chunk actually ends with "</", then htmlParseTryOrFinish() calls
htmlParseScript() to process it. In there, the normal break condition is
coded as follows:

if ((cur == '<') && (NXT(1) == '/')) {
    if (((NXT(2) >= 'A') && (NXT(2) <= 'Z')) ||
        ((NXT(2) >= 'a') && (NXT(2) <= 'z')))
    {
        break; /* while */
    }
}

However, NXT(2) is not guaranteed to be available. So it will not break
but consume the "</", which leads to a broken CDATA parsing in all
cases, even without PARSE_HTML_RECOVER being set. This could be solved
by avoiding calling htmlParseScript() with a chunk ending with "</". 

The case with the CDATA recovery option is even more complicated.

I wonder what you think if we would check in htmlParseTryOrFinish() that
the last 8 characters of the chunk do not include "</" before calling
htmlParseScript() in order to solve both cases? Assuming we are in a
CDATA block being followed by at least one real end tag and other tags
afterwards this should be safe, shouldn't it?


  I think delaying calling the parser if "</" is present in the last 8 
character would be somewhat broken. You could perfectly find a number of
other elements after the script/style block (actually I would expect that)
and those need to be closed.
  What should be checked is probably that there is more than 8 characters
in the buffer for consumption there (i.e. avail >=8), that should be safe:
   - it garantee we can test for the tag name
   - a style or script is unlikely to be at the very end of an HTML document
     (and if it is it we would have terminate), plus it's not yet displayable
     content so waiting for the next packet should not generate a degradation
     there.

  Can you test by changing the condition to:

                    if ((!terminate) &&
                        ((htmlParseLookupSequence(ctxt, '<', '/', 0, 0) < 0) ||
                         (avail < 8)))
                        goto done;

in that "Handle SCRIPT/STYLE separately" section and report ? If positive 
provide a contextual patch :-)

PS: Please let me know if such detailed source code discussions are not
supposed to be done on the list


  that's fine, that's where the knowledge should be shared!

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

References:
- Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
  - From: Cyrill Osterwalder

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]