Re: [xml] [PATCH] less-than character and HTML parser module



On Sunday 26 April 2015 03:24:35 Christian Schoenebeck wrote:
The 2nd patch (libxml-invalid-tag-as-text.patch) uses that more general way
to resolve this overall issue. That is, instead of looking at the content
and trying to guess ahead whether a less than character will yield in a
valid tag, this 2nd patch rather uses the regular element parse code, and
if it fails to parse the tag start it returns a special return value which
will cause the next input to be consumed as text instead. Most notably
this solution has the advantage, that many more misfit cases will be
consumed as text instead (if recovery option is on). For example this 2nd
patch also allows to consume this:

      a << b

The 1st patch would still have failed in this case.

Please review this 2nd patch carefully though. Because that patch is
rewinding the parser input, and since I am not very familiar with the
libxm2 internals, I am not sure whether my rewinding code is a) safe and
b) if it does actually work with all kinds of input stream types supported
by the libxml2 API.

After feeding this solution with a bunch of larger HTML files, it seems indeed 
as if my used rewind code in that 2nd patch is incorrect. Because with some 
larger HTML files, some larger portions of the test files were dropped.

Maybe somebody with better knowledge of the libxml internals could comment 
this.

Best regards,
Christian Schoenebeck


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]