Re: [xml] [PATCH] less-than character and HTML parser module

From: Chris Tapp <opensource keylevel com>
To: Christian Schoenebeck <schoenebeck crudebyte com>
Cc: xml gnome org
Subject: Re: [xml] [PATCH] less-than character and HTML parser module
Date: Tue, 14 Apr 2015 16:50:51 +0100

On 14 Apr 2015, at 15:24, Christian Schoenebeck <schoenebeck crudebyte com> wrote:

On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote:

On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoenebeck crudebyte com>

wrote:

I just encountered an issue with stand-alone less-than characters if the
document is parsed by libxml2's HTML parser module. Consider you have a
text

in your HTML document like:
    a < b

The less-than sign in this case is interpreted by the HTML parser module
as tag start, causing subsequent text (in this case "< b") to be
dropped.


Isn't that correct? Shouldn't your document have

    a &lt; b


If it was a well-formed HTML document, then yes. But as said, in reality there
are a load of HTML documents which contain text with raw less-than characters,
supported by the fact that all major HTML browsers can handle it. libxml's
HTML parser is yet an exception here.

Attached you find a patch, suggesting a fix for this issue.


If anything like this does get put in, it should only be if it is a configurable option that is disabled by 
default - an xml parser should only accept a strictly-conforming document by default. Adding support for 
‘broken’ html because other (weak) parsers allow it is not a good plan as it causes divergence from the 
standard.

--

Chris Tapp
opensource keylevel com
www.keylevel.com

----
You can tell you're getting older when your car insurance gets real cheap!

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Follow-Ups:
- Re: [xml] [PATCH] less-than character and HTML parser module
  - From: Christian Schoenebeck
- Re: [xml] [PATCH] less-than character and HTML parser module
  - From: Daniel Veillard

References:
- [xml] less-than character and HTML parser module
  - From: Christian Schoenebeck
- Re: [xml] less-than character and HTML parser module
  - From: Alex Bligh
- [xml] [PATCH] less-than character and HTML parser module
  - From: Christian Schoenebeck

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]