[xml] [PATCH] less-than character and HTML parser module

From: Christian Schoenebeck <schoenebeck crudebyte com>
To: xml gnome org
Subject: [xml] [PATCH] less-than character and HTML parser module
Date: Tue, 14 Apr 2015 16:24:45 +0200

On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote:

On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoenebeck crudebyte com>

wrote:

I just encountered an issue with stand-alone less-than characters if the
document is parsed by libxml2's HTML parser module. Consider you have a
text

in your HTML document like:
    a < b

The less-than sign in this case is interpreted by the HTML parser module
as tag start, causing subsequent text (in this case "< b") to be
dropped.


Isn't that correct? Shouldn't your document have

     a &lt; b


If it was a well-formed HTML document, then yes. But as said, in reality there 
are a load of HTML documents which contain text with raw less-than characters, 
supported by the fact that all major HTML browsers can handle it. libxml's 
HTML parser is yet an exception here.

Attached you find a patch, suggesting a fix for this issue.

Best regards,
Christian Schoenebeck

Attachment: libxml2-less-than-char.patch
Description: Text Data

Follow-Ups:
- Re: [xml] [PATCH] less-than character and HTML parser module
  - From: Chris Tapp

References:
- [xml] less-than character and HTML parser module
  - From: Christian Schoenebeck
- Re: [xml] less-than character and HTML parser module
  - From: Alex Bligh

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]