Re: [xml] Patch to improve HTMLparser's robustness

On Tue, Apr 22, 2008 at 12:18:20PM -0400, Daniel Veillard wrote:
On Tue, Apr 22, 2008 at 03:56:33PM +0200, Arnold Hendriks wrote:
Daniel Veillard wrote:
 I think the embedding error condition should be noted somewhere in the 
parser state and disable at least partially the closing tag processing so
that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
It probably should generate an error, yes. My patch simply ignores the 

  but break the normal cases, which is not acceptable, nice try ;-)

  Proper patch, reusing ctxt->depth which is not used in the HTML parser
yet to count the number of times an opening tag has been ignored, and 
reused to drop the closing tags. Of course extra or missing ending tags
are still possible, but at this point one can only do heuristics. Works
properly for me, will commit soonish unless i hear a good reason against it
in the meantime:

wei:~/XML -> ./xmllint --html autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html xml:lang="en" xmlns="foobar">
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "";>
<p>some text

<p>embbeded text</p>

<p>end text
wei:~/XML ->


Red Hat Virtualization group
Daniel Veillard      | virtualization library
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

Attachment: autoskip.patch
Description: Text document

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]