Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8



On Wed, Jan 23, 2019 at 12:55 PM Nick Wellnhofer <wellnhofer aevum de> wrote:
The commit obviously also affected documents that didn't need encoding
conversion. I didn't realize that.

Aha! I noticed that the chromium link you sent mentions a >32KB string which gets converted to a >64KB string, which sounded suspiciously similar. Looks like lxml's feed() function [1] is doing the same thing. I don't know too much about Python's C API, but [2] [3] suggests lxml is using a deprecated macro and giving libxml2 a multibyte buffer even though the input would fit into pure ASCII. This explains why it behaved differently than xmllint.

[1] https://github.com/lxml/lxml/blob/master/src/lxml/parser.pxi#L1242
[2] https://stackoverflow.com/questions/26079392/how-is-unicode-represented-internally-in-python
[3] https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AS_DATA

I also noticed that feed() is doing something special with the first 4 bytes, giving them to _htmlCtxtResetPush() instead of htmlParseChunk(). So the discussion about buffer boundaries might be slightly incorrect.

At least we know that the issue is isolated
to 2.9.8. Thanks for your efforts!

Yes, thank you. Now it's clear that my immediate issue is solved and version 2.9.9 works. So I probably won't look into this much further.

I guess it's up to you to decide what to do next, and if any libxml2 changes are needed. It would be good to add some tests to decrease the likelihood that this issue or something similar happens again. For that, you might still need to isolate the root cause further, and create a pure C test case. (Maybe based on a test case from chromium instead of mine.) But of course it's up to you to determine the priority of that. Thanks again for your help, and good luck if you decide to continue.

Tomi



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]