On Thursday 16 April 2015 13:59:28 Christian Schoenebeck wrote:
On Thursday 16 April 2015 10:32:32 you wrote:There you go; you find the updated patch attached. It now requires HTML_PARSE_RECOVER option to be set for recovering from stand-alone less-than characters.That sounds fine *except* it doesn't raise an error. The parser knows it's a broken construct that must be pointed out.Ok, I see what I can do about that. ;)
Even two patches this time. The 1st patch (libxml-less-than-char_v3.patch) just addresses that last minor issue you came up with regarding missing error messages. However you probably might skip that patch and rather look at the second patch instead.
It sounds a bit weird to handle that error case as one of the main content cases, I would still be tempted to go into htmlParseStartTag, get the error reported, but push corrective data instead in recover mode.My initial thought solution was to enter htmlParseElement() like before, and in case htmlParseElement() encounters an error, it would handle the chunk as text instead (if recover option is on). That would probably come to the closest what most browsers seem to do. But the problem: that would
The 2nd patch (libxml-invalid-tag-as-text.patch) uses that more general way to resolve this overall issue. That is, instead of looking at the content and trying to guess ahead whether a less than character will yield in a valid tag, this 2nd patch rather uses the regular element parse code, and if it fails to parse the tag start it returns a special return value which will cause the next input to be consumed as text instead. Most notably this solution has the advantage, that many more misfit cases will be consumed as text instead (if recovery option is on). For example this 2nd patch also allows to consume this: a << b The 1st patch would still have failed in this case. Please review this 2nd patch carefully though. Because that patch is rewinding the parser input, and since I am not very familiar with the libxm2 internals, I am not sure whether my rewinding code is a) safe and b) if it does actually work with all kinds of input stream types supported by the libxml2 API. Best regards, Christian Schoenebeck
Attachment:
libxml2-less-than-char_v3.patch
Description: Text Data
Attachment:
libxml2-invalid-tag-as-text.patch
Description: Text Data