Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8



On 22/01/2019 19:11, Tomi Belan wrote:
I tried to reproduce it with only xmllint as you suggest, but I'm not having much luck. It produces correct results with "--html --debug bad.html", "--html --debug --stream bad.html", "--html --debug --push bad.html", and "--html --debug --sax bad.html".

Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode, or streaming, etc. But at this point I wouldn't be too surprised if it depended on the size of some internal input buffer that's different in lxml vs xmllint. I'd welcome any advice about what else I should try, or how can I find out what calls are being made from lxml to libxml2.

From a quick look at the lxml source, it seems that the `feed` method of HTMLParser calls htmlParseChunk, so you should pass `--html --push` to xmllint. But if it's a buffer boundary issue, you might have to recreate the exact chunk sizes to reproduce the problem. lxml seems to split into chunks of size INT_MAX, meaning a single chunk in most cases. xmllint first passes a chunk of 4 bytes, then splits the remaining data into chunks of 4096 bytes. But maybe I'm missing something. To be sure, you could run your Python code under a debugger like gdb and set a break point on htmlParseChunk. Also break on htmlCtxtUseOptions to see which parser options are used exactly.

You could also start experimenting with feeding chunks of different sizes in your Python script or with a small C program that calls htmlParseChunk in the same way as lxml, presumably writing a single chunk. You could also try to add 4 bytes somewhere at the beginning of `bad.html` and see if it helps with reproducing the issue using xmllint.

Other than that: It's not ideal, but could you please check if you can also reproduce the bug with the first set of commands I posted? Just to verify it's not just me.

Yes, I can try.

Nick


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]