Re: [xml] Speed up patches for large attributes



On Fri, Jul 10, 2009 at 11:04:38AM +0200, Diego Santa Cruz wrote:
Hi,

I've sent a couple of patches to the mailing list last week about speeding up
the parsing of very large attributes when using the reader API, although I
think they did not go through the mailing list. I would like to get some
feedback and contribute these to libxml2, so here they are again, sorry if
you get this in double.

To summarize the problem we are hitting is that xmlTextReaderPushData()
parses in chunk of 512 bytes but xmlParseChunk() will wait until having a
complete tag to proceed with parsing. This is normally OK but when the
attribute is very large (e.g., 600 K, with inline image data in data hrefs),
ends up having O(n^2) complexity as each call to xmlParseChunk() will rescan
the complete accumulated buffer for the start and end of tag. For instance,
on an ARM9 at 300 MHz it takes 29 sec to parse such a node with a 600 K
attribute.

The first patch below (patch #1) takes the simple approach of doubling the
chunk size each time a call to xmlParseChunk() does not parse anything and
resetting back to 512 bytes whenever something is parsed. It takes down the
parsing time from 29s to 490 ms for our test file. The problem I see with
this approach is that we may end up parsing a lot of stuff on a call to
xmlParseChunk() leading to a very large tree. For instance if a 514 K buffer
is required to hold the complete tag and we end up passing a chunk of 1024 K
to the parser.

The second thing I have tried (pacth #2 below) is to start looking for a
potential end of tag ('>') when a call to xmlParseChunk() does not parse
anything. The loop in xmlTextReaderPushData() does not call xmlParseChunk()
until a '>' is found. When the '>' is found it calls xmlParseChunk() with the
chunk size set to the end of the '>'. With this the parsing time is 450 ms,
which is just a bit better than the first approach but avoids passing
needlessly large chunks to xmlParseChunk().

I am very new to libxml2 internals, so I would really appreciate some
feedback. Are these acceptable ways to speed up parsing of this kind of xml
docs? Are there any adverse side effects of doing this?

  I saw your patches, on bugzilla too, but I hadn't any time yet to
check them, I hope to do that in the near future. The pre-scanning
in xmlParseChunk() can be tricky, so this really need some serious
attention. As a first test make sure that "make check" passes okay
on a git checkout with the patch applied, that will help building
trust :-)

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]