Re: [xml] Character reference encoding is slow



On Fri, Aug 29, 2008 at 09:37:41AM +0200, Stefan Behnel wrote:
Hi,

we got a report on the lxml list where someone tried to parse and
serialise a file that contains 8,000,000 non-ASCII character references
(‡), as in

    "<text>" + "&#135;" * 8000000 + "</text>"

Parsing this is pretty fast, so that's not the problem, but serialising
this document back to a "US-ASCII" encoding, i.e. re-encoding the
non-ASCII characters as character references, is slow as hell. The user
stopped the run after 12 hours at 100% CPU load. I tried this with xmllint
and you can literally wait for each byte that arrives in the target file.

Is there any reason why this is so, or does anyone have any insights what
the problem may be here? This definitely sounds like a bug to me.

  Well that's an horribly crappy XML document.
I assume the output buffer grows lineary, so you end up realloc'ing all
the time and hit a quadratic behaviour as a result, somehow the
reallocation of the buffer size should probably use a doubling at each
step algorithm. Plus the escaping is done while the ASCII encoder stops.

If I have 2mn i will try to look at this today before the 2.7.0 release

  BTW if people have a bit of time checking SVN lastest version for
sanity should help.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]