Re: [xml] [PATCH] less-than character and HTML parser module

From: Daniel Veillard <veillard redhat com>
To: Christian Schoenebeck <schoenebeck crudebyte com>
Cc: xml gnome org
Subject: Re: [xml] [PATCH] less-than character and HTML parser module
Date: Tue, 30 Jun 2015 11:43:34 +0800

On Thu, Apr 16, 2015 at 04:32:32PM +0800, Daniel Veillard wrote:

On Tue, Apr 14, 2015 at 05:43:42PM +0200, Christian Schoenebeck wrote:

On Tuesday 14 April 2015 17:50:51 you wrote:

If anything like this does get put in, it should only be if it is a
configurable option that is disabled by default - an xml parser should
only accept a strictly-conforming document by default. Adding support for
‘broken’ html because other (weak) parsers allow it is not a good plan as
it causes divergence from the standard.


There you go; you find the updated patch attached. It now requires 
HTML_PARSE_RECOVER option to be set for recovering from stand-alone less-than 
characters.


That sounds fine *except* it doesn't raise an error.
The parser knows it's a broken construct that must be pointed out.

thinkpad2:~/XML -> ./xmllint --html tst.html
tst.html:3: HTML parser error : htmlParseStartTag: invalid element name
<p> blah < booh </p>
          ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<body>
<p> blah 
</p>
</body>
</html>
thinkpad2:~/XML -> ./xmllint --html --recover tst.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<body>
<p> blah &lt; booh </p>
</body>
</html>
thinkpad2:~/XML -> 

 the fact that we worked around a broken start tag construct must be reported.
Whether we do that with the recovery option or not is less important IMHO.

 It sounds a bit weird to handle that error case as one of the main content
cases, I would still be tempted to go into htmlParseStartTag, get the
error reported, but push corrective data instead in recover mode.

 Can we get a v3 ? :-)

  thanks

Daniel


  Okay, I did it,
it does what you expect, it doesn't rewing on input, it doesn't
modify the main content loop routine, and it raises the same error
message as when processed in non-recovery mode:

 https://git.gnome.org/browse/libxml2/commit/?id=140c251e8e5653572edcca91b9d675f871735cb4

thinkpad:~/XML -> cat tst.html
<body>
<p>  a <b </p>
<p>  a < b </p>
<p> a < b> </p>
<p> a <0 </p>
<p> a <=0 </p>
</body>
thinkpad:~/XML -> ./xmllint --html --recover tst.html
tst.html:2: HTML parser error : error parsing attribute name
<p>  a <b </p>
          ^
tst.html:3: HTML parser error : htmlParseStartTag: invalid element name
<p>  a < b </p>
        ^
tst.html:4: HTML parser error : htmlParseStartTag: invalid element name
<p> a < b> </p>
       ^
tst.html:5: HTML parser error : htmlParseStartTag: invalid element name
<p> a <0 </p>
       ^
tst.html:6: HTML parser error : htmlParseStartTag: invalid element name
<p> a <=0 </p>
       ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<body>
<p>  a <b>
</b></p>
<p>  a &lt; b </p>
<p> a &lt; b&gt; </p>
<p> a &lt;0 </p>
<p> a &lt;=0 </p>
</body>
</html>
thinkpad:~/XML -> 

  thanks for raising the issue and the initial patches !

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]