Re: [xml] Character reference encoding is slow

From: Daniel Veillard <veillard redhat com>
To: Stefan Behnel <stefan_ml behnel de>
Cc: xml gnome org
Subject: Re: [xml] Character reference encoding is slow
Date: Fri, 29 Aug 2008 11:51:54 +0200

On Fri, Aug 29, 2008 at 09:37:41AM +0200, Stefan Behnel wrote:

Hi,

we got a report on the lxml list where someone tried to parse and
serialise a file that contains 8,000,000 non-ASCII character references
(&#135;), as in

    "<text>" + "&#135;" * 8000000 + "</text>"

Parsing this is pretty fast, so that's not the problem, but serialising
this document back to a "US-ASCII" encoding, i.e. re-encoding the
non-ASCII characters as character references, is slow as hell. The user
stopped the run after 12 hours at 100% CPU load. I tried this with xmllint
and you can literally wait for each byte that arrives in the target file.

Is there any reason why this is so, or does anyone have any insights what
the problem may be here? This definitely sounds like a bug to me.


  Well that's an horribly crappy XML document.
I assume the output buffer grows lineary, so you end up realloc'ing all
the time and hit a quadratic behaviour as a result, somehow the
reallocation of the buffer size should probably use a doubling at each
step algorithm. Plus the escaping is done while the ASCII encoder stops.

If I have 2mn i will try to look at this today before the 2.7.0 release

  BTW if people have a bit of time checking SVN lastest version for
sanity should help.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Follow-Ups:
- Re: [xml] Character reference encoding is slow
  - From: Daniel Veillard

References:
- [xml] Character reference encoding is slow
  - From: Stefan Behnel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]