[xml] Scalability problems with the reader API

From: "Diego Santa Cruz" <Diego SantaCruz spinetix com>
To: <xml gnome org>
Subject: [xml] Scalability problems with the reader API
Date: Fri, 3 Jul 2009 14:47:00 +0200

Hi,

We are using libxml2 with good success but we have hit a scalability problem
with the reader API when large attributes are present. When elements have
very large attributes the time it takes to parse the element is basically
O(n^2). The typical problematic file is an SVG document with inlined image
data (e.g., a base64 encoded jpeg in a data URL within an href), where
attributes can easily be 600 K.

The problem appears to come from the fact that xmlTextReaderPushData() will
only feed 512 bytes (CHUNK_SIZE) to xmlParseChunk() at a time. On each call
to xmlParseChunk() xmlParseGetLasts() is called to find the start and end of
the element, which of course cannot be found until the whole element is
loaded into the buffer (e.g., 600 K), so the loop is repeated increasing the
buffer size by just 512 bytes at a time, and each time the buffer is entirely
rescanned looking for the '<' and '>'. This is slow on a fast PC but it
becomes awfully slow on an embedded platform.

Would you think doubling the chunk size fed to xmlParseChunk() on each
iteration of the while loop in xmlTextReaderPushData() be a sane approach to
lowering the complexity of parsing such documents ?

Thanks,

Diego

--
Diego Santa Cruz, PhD
Technology Architect
_________________________________
SpinetiX S.A.
Rue des Terreaux 17
1003, Lausanne, Switzerland
T +41 21 341 15 50
F +41 21 311 19 56
diego santacruz spinetix com
http://www.spinetix.com
http://www.youtube.com/SpinetiXTeam
_________________________________

Follow-Ups:
- Re: [xml] Scalability problems with the reader API
  - From: Diego Santa Cruz

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]