Re: [xml] [PATCH] less-than character and HTML parser module




On 14 Apr 2015, at 15:24, Christian Schoenebeck <schoenebeck crudebyte com> wrote:

On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote:
On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoenebeck crudebyte com>
wrote:
I just encountered an issue with stand-alone less-than characters if the
document is parsed by libxml2's HTML parser module. Consider you have a
text

in your HTML document like:
    a < b

The less-than sign in this case is interpreted by the HTML parser module
as tag start, causing subsequent text (in this case "< b") to be
dropped.

Isn't that correct? Shouldn't your document have

    a &lt; b

If it was a well-formed HTML document, then yes. But as said, in reality there
are a load of HTML documents which contain text with raw less-than characters,
supported by the fact that all major HTML browsers can handle it. libxml's
HTML parser is yet an exception here.

Attached you find a patch, suggesting a fix for this issue.

If anything like this does get put in, it should only be if it is a configurable option that is disabled by 
default - an xml parser should only accept a strictly-conforming document by default. Adding support for 
‘broken’ html because other (weak) parsers allow it is not a good plan as it causes divergence from the 
standard.

--

Chris Tapp
opensource keylevel com
www.keylevel.com

----
You can tell you're getting older when your car insurance gets real cheap!

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]