[xml] Character reference encoding is slow

From: "Stefan Behnel" <stefan_ml behnel de>
To: xml gnome org
Subject: [xml] Character reference encoding is slow
Date: Fri, 29 Aug 2008 09:37:41 +0200 (CEST)

Hi,

we got a report on the lxml list where someone tried to parse and
serialise a file that contains 8,000,000 non-ASCII character references
(&#135;), as in

    "<text>" + "&#135;" * 8000000 + "</text>"

Parsing this is pretty fast, so that's not the problem, but serialising
this document back to a "US-ASCII" encoding, i.e. re-encoding the
non-ASCII characters as character references, is slow as hell. The user
stopped the run after 12 hours at 100% CPU load. I tried this with xmllint
and you can literally wait for each byte that arrives in the target file.

Is there any reason why this is so, or does anyone have any insights what
the problem may be here? This definitely sounds like a bug to me.

Stefan

Follow-Ups:
- Re: [xml] Character reference encoding is slow
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]