Re: [xml] Scalability problems with the reader API

From: "Diego Santa Cruz" <Diego SantaCruz spinetix com>
To: "Diego Santa Cruz" <Diego SantaCruz spinetix com>, <xml gnome org>
Subject: Re: [xml] Scalability problems with the reader API
Date: Mon, 6 Jul 2009 15:54:03 +0200

-----Original Message-----
From: xml-bounces gnome org [mailto:xml-bounces gnome org] On Behalf Of
Diego Santa Cruz
Sent: 03 July 2009 14:47
To: xml gnome org
Subject: [xml] Scalability problems with the reader API

[snip]


Would you think doubling the chunk size fed to xmlParseChunk() on each
iteration of the while loop in xmlTextReaderPushData() be a sane
approach to
lowering the complexity of parsing such documents ?


I implemented this idea (see attached xml-perf.patch) and it indeed speeds up
the processing of the docs with very large attributes. In our embedded
platform (an ARM9 at 300 MHz) the time to read the node with a 600 KB
attribute goes down from 29s to about 490 ms, on a recent PC it goes down
from 480 ms to 14 ms.

The problem I see with doubling the chunk size is that we may end parsing a
lot of stuff into the tree if the chunk size was just not enough to hold the
complete tag but after doubling it will read a zillion of normally sized tags
(e.g., the attribute is 514 KB and the chunk size ends up being 1024 KB).

So I have tried with a more sophisticated strategy (see attached
xml-perf2.patch) where if a call to xmlParseChunk() does not parse anything
we starting looking for a '>' and refrain from proceeding to xmlParseChunk()
until one is found or the eof is reached. Once the '>' is found only the
buffer up to there is passed xmlParseChunk() (unless it is less than 512
bytes, in which case we pass 512).

The '>' is not always the end of tag due to unescaped '>' in attributes but
it should catch most cases. Since the scanning for '>' is done after
xmlParserInputBufferRead() on only the newly read data, and after
xmlParseChunk() on only the non yet parsed data it should make best use of
CPU caches as processing is memory local.

I tested this second strategy and the parsing time for the trouble node is 6%
better on our embedded platform than simple chunk size doubling, which is not
that much of a gain but it should also avoid the ill cases mentioned above.

Does this look like a good solution to the problem or does anyone see some
bad side effect?

I am not very familiar with the internals of libxml2 so any feedback is
appreciated.

Thanks,

Diego

--
Diego Santa Cruz, PhD
Technology Architect
_________________________________
SpinetiX S.A.
Rue des Terreaux 17
1003, Lausanne, Switzerland
T +41 21 341 15 50
F +41 21 311 19 56
diego santacruz spinetix com
http://www.spinetix.com
http://www.youtube.com/SpinetiXTeam
_________________________________

Attachment: xml-perf.patch
Description: xml-perf.patch

Attachment: xml-perf2.patch
Description: xml-perf2.patch

References:
- [xml] Scalability problems with the reader API
  - From: Diego Santa Cruz

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]