[xml] [PATCH] less-than character and HTML parser module



On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote:
On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoenebeck crudebyte com> 
wrote:
I just encountered an issue with stand-alone less-than characters if the
document is parsed by libxml2's HTML parser module. Consider you have a
text

in your HTML document like:
    a < b

The less-than sign in this case is interpreted by the HTML parser module
as tag start, causing subsequent text (in this case "< b") to be
dropped.

Isn't that correct? Shouldn't your document have

     a &lt; b

If it was a well-formed HTML document, then yes. But as said, in reality there 
are a load of HTML documents which contain text with raw less-than characters, 
supported by the fact that all major HTML browsers can handle it. libxml's 
HTML parser is yet an exception here.

Attached you find a patch, suggesting a fix for this issue.

Best regards,
Christian Schoenebeck

Attachment: libxml2-less-than-char.patch
Description: Text Data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]