Re: [xml] Character reference encoding is slow

On Fri, Aug 29, 2008 at 11:51:54AM +0200, Daniel Veillard wrote:
On Fri, Aug 29, 2008 at 09:37:41AM +0200, Stefan Behnel wrote:

we got a report on the lxml list where someone tried to parse and
serialise a file that contains 8,000,000 non-ASCII character references
(‡), as in

    "<text>" + "&#135;" * 8000000 + "</text>"

Parsing this is pretty fast, so that's not the problem, but serialising
this document back to a "US-ASCII" encoding, i.e. re-encoding the
non-ASCII characters as character references, is slow as hell. The user
stopped the run after 12 hours at 100% CPU load. I tried this with xmllint
and you can literally wait for each byte that arrives in the target file.

Is there any reason why this is so, or does anyone have any insights what
the problem may be here? This definitely sounds like a bug to me.

  Well that's an horribly crappy XML document.
I assume the output buffer grows lineary, so you end up realloc'ing all
the time and hit a quadratic behaviour as a result, somehow the
reallocation of the buffer size should probably use a doubling at each
step algorithm. Plus the escaping is done while the ASCII encoder stops.

  I looked at it, and sorry the test is so broken i can't fix the
behaviour at least not in a simple and fair way. I will explain:

  You have 1 document.
  It contains 16 megabytes of UTF characters.
  You ask libxml2 to convert this on output to US-ASCII.

The way libxml2 does when asked to convert data to a given encoding is
to use iconv or the set of internal conversion routines. It will pass
the data by big chunks (as large as reasonable possible to speed up the
convertion process) to the encoder.
Libxml2 has no apriori knowledge of what an encoding may support in term
of range of character, and people can plug their own encoding handlers
which may be different from what libxml2 would have done using the 
internal handlers. So basically libxml2 has absolutely no way when
passing a character to the output converter to know if the caracter will
be accepted ... or not.
So libxml2 relies on the encoder to tell him if the character could not
be converted, that's what it does in xmlCharEncOutFunc() in encoding.c
around line 1950, it feeds the converter, which may accept to convert
only a subset of the string. In that case the set written is pushed to
the output buffer, and the return value -2 allow to find that there is
a character from the UTF-8 input buffer which could not be converted.
Libxml2 then removes it from the input buffer which was shrinked to that
location, and repushes it at the beginning, and retry the conversion.
In your pathological case *none* of the character actually pass the
converter. So the exception mechanism has to be used for each of the
8 million characters, each time shrinking and growing an input buffer
and retrying again, to fail converting etc...
The CPU is loosing all its time doing memory copies on the shrinking
and growing of the buffers and of course the loop repeating tries to
convert characters and failing, then the sprintf.
The conversion to character references is a fallback mechanism designed
for exceptional cases, this exemple pushes it to the limit with a single
huge input text block and none of the characters in it passing though
the encoder.

  I could try to fix your specific case by making the internal ASCII
converter of libxml2 automatically accept out of ASCII range characters
and convert them to character references in the output. That would be
absolute cheating, if for some reason iconv ascii converter was used
instead of the internal one this would fail in the same way.

  Anyway to make a long story short, I spent a few hours today
fixing the problem by adding support for a new kind of buffers avoiding
most of the memmoves needed when handling an encoding conversion
exception like those. Bill reported that your test now parses and
save in more or less the same time i.e. 7secs on one of his boxes.
Grab 2.7.0 !


Daniel Veillard      | libxml Gnome XML XSLT toolkit
daniel veillard com  | Rpmfind RPM search engine | virtualization library

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]