Re: [xml] Patch to improve HTMLparser's robustness
- From: Daniel Veillard <veillard redhat com>
- To: Arnold Hendriks <a hendriks b-lex nl>
- Cc: xml gnome org
- Subject: Re: [xml] Patch to improve HTMLparser's robustness
- Date: Thu, 13 Mar 2008 03:21:44 -0400
On Wed, Dec 12, 2007 at 07:53:10PM +0100, Arnold Hendriks wrote:
I've been running into problems parsing incoming email messages through
libxml2's HTML parser, which when seeing tags such as <html
xml:lang="en" xmlns:....> in an unexpected place, will just eat the
'<html' part and turn the attributes of that html tag into normal text,
causing odd code to appear at the top of email messages. This mostly
affects Outlook/Exchange generated messages.
The attached patch tries to fix it. It works for me, but I wonder
whether I haven't introduced memory allocation issues with it, and hope
the patch (or a similar solution) can be integrated into a future libxml
release.
Hi Arnold,
I didn't forgot about the issue, and got a bit of time to test yesterday
and look at it. First the patch makes senses it fixes a serious problem,
there is no leak, that's fine, but the result is still problematic
laptop:~/XML -> cat autoskip.html
<html><body>
<p>some text
<html xml:lang="en" xmlns="foobar">
<body>
<p>embbeded text</p>
</body>
</html>
<p>end text
</body></html>
laptop:~/XML -> xmllint --html autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html xml:lang="en" xmlns="foobar">
^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<p>some text
</p>
<p>embbeded text</p>
</body>
<html><body><p>end text
</p></body></html>
</html>
laptop:~/XML ->
Basically the error is correctly displayed, but the close of the embedded
body and html tags generate a serious mess. We are able to detect the embedding
but the autoclose kind of misbehaves. moreover if using the push parser the
autoclose ends the document immediately:
laptop:~/XML -> xmllint --html --push autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html xml:lang="en" xmlns="foobar">
^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>some text
</p>
<p>embbeded text</p>
</body></html>
laptop:~/XML ->
I think the embedding error condition should be noted somewhere in the
parser state and disable at least partially the closing tag processing so
that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
paragraph.
That or we show the full subdocument structure, but i don't feel the
current processing is good even if it's clearly better with your patch than
without.
I commited your patch but there is still some cleanup remaining if you
want to look at it,
Thanks !
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]