Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

I also built lxml 4.2.5 with pristine libxml2 2.9.8 (using a variation of the above command), and got the same results. So I don't think it's a distro specific problem.

I tried to reproduce it with only xmllint as you suggest, but I'm not having much luck. It produces correct results with "--html --debug bad.html", "--html --debug --stream bad.html", "--html --debug --push bad.html", and "--html --debug --sax bad.html".

Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode, or streaming, etc. But at this point I wouldn't be too surprised if it depended on the size of some internal input buffer that's different in lxml vs xmllint. I'd welcome any advice about what else I should try, or how can I find out what calls are being made from lxml to libxml2.

Other than that: It's not ideal, but could you please check if you can also reproduce the bug with the first set of commands I posted? Just to verify it's not just me.


On Tue, Jan 22, 2019 at 5:11 PM Nick Wellnhofer <wellnhofer aevum de> wrote:
On 22/01/2019 15:43, Tomi Belan via xml wrote:
> After a lot of debugging, I determined the problem is in libxml2 and not the
> other libraries in my stack, and that it only seems to happen on version
> 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor in the
> diff between them, so I am still worried: I don't know if the bug is really
> fixed, or just dormant. I hope you can find the root cause, and maybe add a
> regression test if you do.

I also don't see any directly related changes in either 2.9.8 or 2.9.9.

> This will download
> the manylinux binary build of lxml 4.2.5, which is statically linked to
> libxml2 2.9.8.

Are you sure that a pristine 2.9.8 build was used? Maybe there are additional
patches added by a distro?

> I couldn't shorten the file very much, because if I delete even a single
> character, the bug stops triggering. (Could it be some buffer boundary issue?)

Yes, a buffer boundary issue seems likely.

> I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected. So
> I believe this is a bug in libxml2 2.9.8 specifically, and not in a particular
> version of lxml.

Did you also try your own build with the official libxml2 2.9.8 sources?

> I hope you can solve the mystery. Please let me know if I can be of any help.

It would help if you could reproduce the issue with xmllint and no Python code
involved. git-bisect might also be useful.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]