Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8



Thanks, that's very useful!

With a dynamically linked build of lxml, I used "ltrace" to see the calls to libxml2. Looks like you're correct there is only one call to htmlParseChunk with the whole content (followed by a zero-length call to terminate the input). But even so I still wasn't able to reproduce it in pure C. Could it be because xmllint reads ctxt->myDoc, and lxml uses SAX2 event handlers (according to parsertarget.pxi)? AFAICT xmllint's --push and --sax options are incompatible.

I had more luck with git bisect. Using a dynamically linked build of lxml, and pointing LD_LIBRARY_PATH to libxml2/.libs/, I successfully found out that the bug was:
- introduced by https://github.com/GNOME/libxml2/commit/6e6ae5daa6cd9640c9a83c1070896273e9b30d14
- fixed(?) by https://github.com/GNOME/libxml2/commit/7a1bd7f6497ac33a9023d556f6f47a48f01deac0

I hope that's meaningful to you, because I have no idea what are those commits doing and how could it be related to this bug... The commits sound related to character encoding, but bad.html is plain ASCII...

Tomi

On Tue, Jan 22, 2019 at 7:56 PM Nick Wellnhofer <wellnhofer aevum de> wrote:
On 22/01/2019 19:11, Tomi Belan wrote:
> I tried to reproduce it with only xmllint as you suggest, but I'm not having
> much luck. It produces correct results with "--html --debug bad.html", "--html
> --debug --stream bad.html", "--html --debug --push bad.html", and "--html
> --debug --sax bad.html".
>
> Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode,
> or streaming, etc. But at this point I wouldn't be too surprised if it
> depended on the size of some internal input buffer that's different in lxml vs
> xmllint. I'd welcome any advice about what else I should try, or how can I
> find out what calls are being made from lxml to libxml2.

 From a quick look at the lxml source, it seems that the `feed` method of
HTMLParser calls htmlParseChunk, so you should pass `--html --push` to
xmllint. But if it's a buffer boundary issue, you might have to recreate the
exact chunk sizes to reproduce the problem. lxml seems to split into chunks of
size INT_MAX, meaning a single chunk in most cases. xmllint first passes a
chunk of 4 bytes, then splits the remaining data into chunks of 4096 bytes.
But maybe I'm missing something. To be sure, you could run your Python code
under a debugger like gdb and set a break point on htmlParseChunk. Also break
on htmlCtxtUseOptions to see which parser options are used exactly.

You could also start experimenting with feeding chunks of different sizes in
your Python script or with a small C program that calls htmlParseChunk in the
same way as lxml, presumably writing a single chunk. You could also try to add
4 bytes somewhere at the beginning of `bad.html` and see if it helps with
reproducing the issue using xmllint.

> Other than that: It's not ideal, but could you please check if you can also
> reproduce the bug with the first set of commands I posted? Just to verify it's
> not just me.

Yes, I can try.

Nick


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]