Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
- From: Nick Wellnhofer <wellnhofer aevum de>
- To: Tomi Belan <tomi belan gmail com>
- Cc: xml gnome org
- Subject: Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
- Date: Tue, 22 Jan 2019 19:56:52 +0100
On 22/01/2019 19:11, Tomi Belan wrote:
I tried to reproduce it with only xmllint as you suggest, but I'm not having
much luck. It produces correct results with "--html --debug bad.html", "--html
--debug --stream bad.html", "--html --debug --push bad.html", and "--html
--debug --sax bad.html".
Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode,
or streaming, etc. But at this point I wouldn't be too surprised if it
depended on the size of some internal input buffer that's different in lxml vs
xmllint. I'd welcome any advice about what else I should try, or how can I
find out what calls are being made from lxml to libxml2.
From a quick look at the lxml source, it seems that the `feed` method of
HTMLParser calls htmlParseChunk, so you should pass `--html --push` to
xmllint. But if it's a buffer boundary issue, you might have to recreate the
exact chunk sizes to reproduce the problem. lxml seems to split into chunks of
size INT_MAX, meaning a single chunk in most cases. xmllint first passes a
chunk of 4 bytes, then splits the remaining data into chunks of 4096 bytes.
But maybe I'm missing something. To be sure, you could run your Python code
under a debugger like gdb and set a break point on htmlParseChunk. Also break
on htmlCtxtUseOptions to see which parser options are used exactly.
You could also start experimenting with feeding chunks of different sizes in
your Python script or with a small C program that calls htmlParseChunk in the
same way as lxml, presumably writing a single chunk. You could also try to add
4 bytes somewhere at the beginning of `bad.html` and see if it helps with
reproducing the issue using xmllint.
Other than that: It's not ideal, but could you please check if you can also
reproduce the bug with the first set of commands I posted? Just to verify it's
not just me.
Yes, I can try.
Nick
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]