On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote:
On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoenebeck crudebyte com>
wrote:
I just encountered an issue with stand-alone less-than characters if the document is parsed by libxml2's HTML parser module. Consider you have a text in your HTML document like: a < b The less-than sign in this case is interpreted by the HTML parser module as tag start, causing subsequent text (in this case "< b") to be dropped.Isn't that correct? Shouldn't your document have a < b
If it was a well-formed HTML document, then yes. But as said, in reality there are a load of HTML documents which contain text with raw less-than characters, supported by the fact that all major HTML browsers can handle it. libxml's HTML parser is yet an exception here. Attached you find a patch, suggesting a fix for this issue. Best regards, Christian Schoenebeck
Attachment:
libxml2-less-than-char.patch
Description: Text Data