Re: [xml] Patch to improve HTMLparser's robustness



Daniel Veillard wrote:
I didn't forgot about the issue, and got a bit of time to test yesterday
and look at it. First the patch makes senses it fixes a serious problem,
there is no leak, that's fine, but the result is still problematic


</body>
</html>
<p>end text
</body></html>


  Basically the error is correctly displayed, but the close of the embedded
body and html tags generate a serious mess. We are able to detect the embedding
but the autoclose kind of misbehaves. moreover if using the push parser the
autoclose ends the document immediately:
Can I cheat? :) Given the fact that nothing should appear between </body> and </html>, and </html> is always the last tag, its' easiest to just ignore them and let the autoclose deal with it...

vz202:~/libxml2/trunk # svn diff HTMLparser.c
Index: HTMLparser.c
===================================================================
--- HTMLparser.c        (revision 3739)
+++ HTMLparser.c        (working copy)
@@ -3646,7 +3646,9 @@
    SKIP(2);

    name = htmlParseHTMLName(ctxt);
-    if (name == NULL)
+    if (name == NULL
+       || xmlStrEqual(name, BAD_CAST "html")
+       || xmlStrEqual(name, BAD_CAST "body") )
        return (0);

    /*


With this patch, I get:

<html xml:lang="en" xmlns="foobar">
    ^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
    ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>
<p>some text

</p>
<p>embbeded text</p>
&gt;
&gt;
<p>end text
&gt;&gt;
</p>
</body></html>

Which looks good enough to me. It's probably at least enough to get it properly through my html email sanitizer.


I think the embedding error condition should be noted somewhere in the parser state and disable at least partially the closing tag processing so
that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
paragraph.
It probably should generate an error, yes. My patch simply ignores the situtation.

--
Arnold Hendriks <a hendriks b-lex nl>
B-Lex Information Technologies <http://www.b-lex.com/>
Postbus 545, 7500 AM Enschede, The Netherlands

B-Lex: +31 (0)53 4836543
Mobile: +31 (0)6 51710159
MSN: a hendriks b-lex nl
ICQ: 86313731




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]