Re: [xml] Character reference encoding is slow
- From: Daniel Veillard <veillard redhat com>
- To: Stefan Behnel <stefan_ml behnel de>
- Cc: xml gnome org
- Subject: Re: [xml] Character reference encoding is slow
- Date: Fri, 29 Aug 2008 11:51:54 +0200
On Fri, Aug 29, 2008 at 09:37:41AM +0200, Stefan Behnel wrote:
Hi,
we got a report on the lxml list where someone tried to parse and
serialise a file that contains 8,000,000 non-ASCII character references
(‡), as in
"<text>" + "‡" * 8000000 + "</text>"
Parsing this is pretty fast, so that's not the problem, but serialising
this document back to a "US-ASCII" encoding, i.e. re-encoding the
non-ASCII characters as character references, is slow as hell. The user
stopped the run after 12 hours at 100% CPU load. I tried this with xmllint
and you can literally wait for each byte that arrives in the target file.
Is there any reason why this is so, or does anyone have any insights what
the problem may be here? This definitely sounds like a bug to me.
Well that's an horribly crappy XML document.
I assume the output buffer grows lineary, so you end up realloc'ing all
the time and hit a quadratic behaviour as a result, somehow the
reallocation of the buffer size should probably use a doubling at each
step algorithm. Plus the escaping is done while the ASCII encoder stops.
If I have 2mn i will try to look at this today before the 2.7.0 release
BTW if people have a bit of time checking SVN lastest version for
sanity should help.
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]