[xml] less-than character and HTML parser module
- From: Christian Schoenebeck <schoenebeck crudebyte com>
- To: xml gnome org
- Subject: [xml] less-than character and HTML parser module
- Date: Mon, 13 Apr 2015 23:43:51 +0200
Hi!
I just encountered an issue with stand-alone less-than characters if the
document is parsed by libxml2's HTML parser module. Consider you have a text
in your HTML document like:
a < b
The less-than sign in this case is interpreted by the HTML parser module as
tag start, causing subsequent text (in this case "< b") to be dropped. It is
not well-formed HTML to have less-than signs raw like this, however in
practice it often occurs with text sections in HTML files this way and
browsers cope with it.
If allowed, I would provide a patch to address this issue. My suggestion: if
the next character following the less-than character is in
(' ' | \n | \r | \t | 0 | '=') then the token is interpreted as text, not as
element. Relevant code section: HTMLparser.c -> htmlParseContent()
Another option would be to recover the original read position if
htmlParseHTMLName() failed. Currently it drops the entire supposed element.
Relevant code section: HTMLparser.c -> htmlParseStartTag().
Best regards,
Christian Schoenebeck
[Date Prev][
Date Next] [Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]