-----Original Message----- From: xml-bounces gnome org [mailto:xml-bounces gnome org] On Behalf Of Diego Santa Cruz Sent: 03 July 2009 14:47 To: xml gnome org Subject: [xml] Scalability problems with the reader API
[snip]
Would you think doubling the chunk size fed to xmlParseChunk() on each iteration of the while loop in xmlTextReaderPushData() be a sane approach to lowering the complexity of parsing such documents ?
I implemented this idea (see attached xml-perf.patch) and it indeed speeds up the processing of the docs with very large attributes. In our embedded platform (an ARM9 at 300 MHz) the time to read the node with a 600 KB attribute goes down from 29s to about 490 ms, on a recent PC it goes down from 480 ms to 14 ms. The problem I see with doubling the chunk size is that we may end parsing a lot of stuff into the tree if the chunk size was just not enough to hold the complete tag but after doubling it will read a zillion of normally sized tags (e.g., the attribute is 514 KB and the chunk size ends up being 1024 KB). So I have tried with a more sophisticated strategy (see attached xml-perf2.patch) where if a call to xmlParseChunk() does not parse anything we starting looking for a '>' and refrain from proceeding to xmlParseChunk() until one is found or the eof is reached. Once the '>' is found only the buffer up to there is passed xmlParseChunk() (unless it is less than 512 bytes, in which case we pass 512). The '>' is not always the end of tag due to unescaped '>' in attributes but it should catch most cases. Since the scanning for '>' is done after xmlParserInputBufferRead() on only the newly read data, and after xmlParseChunk() on only the non yet parsed data it should make best use of CPU caches as processing is memory local. I tested this second strategy and the parsing time for the trouble node is 6% better on our embedded platform than simple chunk size doubling, which is not that much of a gain but it should also avoid the ill cases mentioned above. Does this look like a good solution to the problem or does anyone see some bad side effect? I am not very familiar with the internals of libxml2 so any feedback is appreciated. Thanks, Diego -- Diego Santa Cruz, PhD Technology Architect _________________________________ SpinetiX S.A. Rue des Terreaux 17 1003, Lausanne, Switzerland T +41 21 341 15 50 F +41 21 311 19 56 diego santacruz spinetix com http://www.spinetix.com http://www.youtube.com/SpinetiXTeam _________________________________
Attachment:
xml-perf.patch
Description: xml-perf.patch
Attachment:
xml-perf2.patch
Description: xml-perf2.patch