Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8



On 23/01/2019 16:14, Tomi Belan wrote:
I don't know too much about Python's C API, but [2] [3] suggests lxml is using a deprecated macro and giving libxml2 a multibyte buffer even though the input would fit into pure ASCII. This explains why it behaved differently than xmllint.

Right, if Python passes ASCII codes as, say, 16-bit integers, this will be detected as UTF-16 by libxml2 and encoding conversion will happen behind the scenes. I'm not sure what would happen with an encoding that isn't Unicode compatible. Maybe there's a bug lurking in lxml.

It would be good to add some tests to decrease the likelihood that this issue or something similar happens again.

Yes, that would be nice. But it was only a short-lived regression that I personally don't want to spend more time on. A UTF-16 test case derived from either your or the Chromium bug report would probably make most sense.

Nick


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]