[xml] less-than character and HTML parser module


I just encountered an issue with stand-alone less-than characters if the 
document is parsed by libxml2's HTML parser module. Consider you have a text 
in your HTML document like:

        a < b

The less-than sign in this case is interpreted by the HTML parser module as 
tag start, causing subsequent text (in this case "< b") to be dropped. It is 
not well-formed HTML to have less-than signs raw like this, however in 
practice it often occurs with text sections in HTML files this way and 
browsers cope with it.

If allowed, I would provide a patch to address this issue. My suggestion: if 
the next character following the less-than character is in
(' ' | \n | \r | \t | 0 | '=') then the token is interpreted as text, not as 
element. Relevant code section:  HTMLparser.c -> htmlParseContent()

Another option would be to recover the original read position if 
htmlParseHTMLName() failed. Currently it drops the entire supposed element. 
Relevant code section:  HTMLparser.c -> htmlParseStartTag().

Best regards,
Christian Schoenebeck

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]