Re: [xml] [PATCH] less-than character and HTML parser module



On Tue, Apr 14, 2015 at 04:50:51PM +0100, Chris Tapp wrote:

On 14 Apr 2015, at 15:24, Christian Schoenebeck <schoenebeck crudebyte com> wrote:

On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote:
On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoenebeck crudebyte com>
wrote:
I just encountered an issue with stand-alone less-than characters if the
document is parsed by libxml2's HTML parser module. Consider you have a
text

in your HTML document like:
  a < b

The less-than sign in this case is interpreted by the HTML parser module
as tag start, causing subsequent text (in this case "< b") to be
dropped.

Isn't that correct? Shouldn't your document have

    a &lt; b

If it was a well-formed HTML document, then yes. But as said, in reality there
are a load of HTML documents which contain text with raw less-than characters,
supported by the fact that all major HTML browsers can handle it. libxml's
HTML parser is yet an exception here.

Attached you find a patch, suggesting a fix for this issue.

If anything like this does get put in, it should only be if it is a
configurable option that is disabled by default - an xml parser should
only accept a strictly-conforming document by default. Adding support
for ‘broken’ html because other (weak) parsers allow it is not a
good plan as it causes divergence from the standard.

  it's not the XML parser which is modified, it's the HTML 'lax' one
The problem is that there is already way too many parser options IMHO.

Daniel

Chris Tapp
opensource keylevel com
www.keylevel.com

----
You can tell you're getting older when your car insurance gets real cheap!




_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
https://mail.gnome.org/mailman/listinfo/xml


-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]