[xml] Character reference encoding is slow



Hi,

we got a report on the lxml list where someone tried to parse and
serialise a file that contains 8,000,000 non-ASCII character references
(‡), as in

    "<text>" + "&#135;" * 8000000 + "</text>"

Parsing this is pretty fast, so that's not the problem, but serialising
this document back to a "US-ASCII" encoding, i.e. re-encoding the
non-ASCII characters as character references, is slow as hell. The user
stopped the run after 12 hours at 100% CPU load. I tried this with xmllint
and you can literally wait for each byte that arrives in the target file.

Is there any reason why this is so, or does anyone have any insights what
the problem may be here? This definitely sounds like a bug to me.

Stefan




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]