Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border

From: "Cyrill Osterwalder" <Cyrill Osterwalder visonys com>
To: <veillard redhat com>, <xml gnome org>
Cc: Cyrill Osterwalder <Cyrill Osterwalder visonys com>
Subject: Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
Date: Thu, 22 Jun 2006 08:22:36 +0200

Why do you use PARSE_HTML_RECOVER ? The parser is already 
doing recovery mode to some extend without them 
(I mean the HTML parser :-).


Actually, the problem also seems to exist without PARSE_HTML_RECOVER,
otherwise the test with testHTML.c of the libxml2 package would not show
it, right? I will have to look at this again. I had the impression that
recovery mode is the trigger in htmlParseScript() to actually produce
the problem. But my testHTML.c example can be easily reproduced and it
does not use HTML_RECOVER. With the testHTML.c example it seems that
parsing fails if the CDATA end tag overlaps the chunk boundary. If
that's true even without PASRE_HTML_RECOVER, then it's just a matter of
luck if chunked parsing HTML with CDATA is successful.

with '</ style' or '</foo> and expect taht to close the open tag, and
'style "</" style' and expect to not close it...


I see. I guess there's a reason why the slash in "</" should be quoted
in CDATA contents if not being the real end tag ;-) However, there's a
lot of HTML out "in the wild" containing unquoted "</" strings in CDATA
blocks.

You can try, but it's all very messy IMHO, I will take 
patches if not obviously broken


I will further look at it and get back to the list if I'm able to
produce anything useful.

Thanks,

Cyrill

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]