Re: [xml] Patch to improve HTMLparser's robustness



On Wed, Dec 12, 2007 at 07:53:10PM +0100, Arnold Hendriks wrote:
I've been running into problems parsing incoming email messages through 
libxml2's HTML parser, which when seeing tags such as <html 
xml:lang="en" xmlns:....> in an unexpected place, will just eat the 
'<html' part and turn the attributes of that html tag into normal text, 
causing odd code to appear at the top of email messages. This mostly 
affects Outlook/Exchange generated messages.

The attached patch tries to fix it. It works for me, but I wonder 
whether I haven't introduced memory allocation issues with it, and hope 
the patch (or a similar solution) can be integrated into a future libxml 
release.

  Hi Arnold,

I didn't forgot about the issue, and got a bit of time to test yesterday
and look at it. First the patch makes senses it fixes a serious problem,
there is no leak, that's fine, but the result is still problematic

laptop:~/XML -> cat autoskip.html
<html><body>
<p>some text
<html xml:lang="en" xmlns="foobar">
<body>
<p>embbeded text</p>
</body>
</html>
<p>end text
</body></html>
laptop:~/XML -> xmllint --html autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html xml:lang="en" xmlns="foobar">
     ^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
     ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<body>
<p>some text

</p>
<p>embbeded text</p>
</body>
<html><body><p>end text
</p></body></html>
</html>
laptop:~/XML ->

  Basically the error is correctly displayed, but the close of the embedded
body and html tags generate a serious mess. We are able to detect the embedding
but the autoclose kind of misbehaves. moreover if using the push parser the
autoclose ends the document immediately:

laptop:~/XML -> xmllint --html --push autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html xml:lang="en" xmlns="foobar">
     ^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
     ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>
<p>some text

</p>
<p>embbeded text</p>
</body></html>
laptop:~/XML -> 

  I think the embedding error condition should be noted somewhere in the 
parser state and disable at least partially the closing tag processing so
that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
paragraph.
  That or we show the full subdocument structure, but i don't feel the
current processing is good even if it's clearly better with your patch than
without.
  I commited your patch but there is still some cleanup remaining if you
want to look at it,

  Thanks !

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]