On 14 Apr 2015, at 15:24, Christian Schoenebeck <schoenebeck crudebyte com> wrote: On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote:On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoenebeck crudebyte com>wrote:I just encountered an issue with stand-alone less-than characters if the document is parsed by libxml2's HTML parser module. Consider you have a text in your HTML document like: a < b The less-than sign in this case is interpreted by the HTML parser module as tag start, causing subsequent text (in this case "< b") to be dropped.Isn't that correct? Shouldn't your document have a < bIf it was a well-formed HTML document, then yes. But as said, in reality there are a load of HTML documents which contain text with raw less-than characters, supported by the fact that all major HTML browsers can handle it. libxml's HTML parser is yet an exception here. Attached you find a patch, suggesting a fix for this issue.
If anything like this does get put in, it should only be if it is a configurable option that is disabled by default - an xml parser should only accept a strictly-conforming document by default. Adding support for ‘broken’ html because other (weak) parsers allow it is not a good plan as it causes divergence from the standard. -- Chris Tapp opensource keylevel com www.keylevel.com ---- You can tell you're getting older when your car insurance gets real cheap!
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail