Re: [xml] [PATCH] less-than character and HTML parser module

I just encountered an issue with stand-alone less-than characters if the
document is parsed by libxml2's HTML parser module. Consider you have a

in your HTML document like:
  a < b

The less-than sign in this case is interpreted by the HTML parser module
as tag start, causing subsequent text (in this case "< b") to be

Isn't that correct? Shouldn't your document have

    a &lt; b

If it was a well-formed HTML document, then yes. But as said, in reality there
are a load of HTML documents which contain text with raw less-than characters,
supported by the fact that all major HTML browsers can handle it. libxml's
HTML parser is yet an exception here.

Attached you find a patch, suggesting a fix for this issue.

If anything like this does get put in, it should only be if it is a
configurable option that is disabled by default - an xml parser should
only accept a strictly-conforming document by default. Adding support
for ‘broken’ html because other (weak) parsers allow it is not a
good plan as it causes divergence from the standard.

  it's not the XML parser which is modified, it's the HTML 'lax' one
The problem is that there is already way too many parser options IMHO.


