[xml] Speed up patches for large attributes



Hi,

I've sent a couple of patches to the mailing list last week about speeding up
the parsing of very large attributes when using the reader API, although I
think they did not go through the mailing list. I would like to get some
feedback and contribute these to libxml2, so here they are again, sorry if
you get this in double.

To summarize the problem we are hitting is that xmlTextReaderPushData()
parses in chunk of 512 bytes but xmlParseChunk() will wait until having a
complete tag to proceed with parsing. This is normally OK but when the
attribute is very large (e.g., 600 K, with inline image data in data hrefs),
ends up having O(n^2) complexity as each call to xmlParseChunk() will rescan
the complete accumulated buffer for the start and end of tag. For instance,
on an ARM9 at 300 MHz it takes 29 sec to parse such a node with a 600 K
attribute.

The first patch below (patch #1) takes the simple approach of doubling the
chunk size each time a call to xmlParseChunk() does not parse anything and
resetting back to 512 bytes whenever something is parsed. It takes down the
parsing time from 29s to 490 ms for our test file. The problem I see with
this approach is that we may end up parsing a lot of stuff on a call to
xmlParseChunk() leading to a very large tree. For instance if a 514 K buffer
is required to hold the complete tag and we end up passing a chunk of 1024 K
to the parser.

The second thing I have tried (pacth #2 below) is to start looking for a
potential end of tag ('>') when a call to xmlParseChunk() does not parse
anything. The loop in xmlTextReaderPushData() does not call xmlParseChunk()
until a '>' is found. When the '>' is found it calls xmlParseChunk() with the
chunk size set to the end of the '>'. With this the parsing time is 450 ms,
which is just a bit better than the first approach but avoids passing
needlessly large chunks to xmlParseChunk().

I am very new to libxml2 internals, so I would really appreciate some
feedback. Are these acceptable ways to speed up parsing of this kind of xml
docs? Are there any adverse side effects of doing this?

Thanks,

Diego

Patch #1: simple doubling of chunk size
===================================================================
--- xmlreader.c (revision 7902)
+++ xmlreader.c (working copy)
@@ -809,6 +809,7 @@
     xmlBufferPtr inbuf;
     int val, s;
     xmlTextReaderState oldstate;
+    int csize = CHUNK_SIZE;
 
     if ((reader->input == NULL) || (reader->input->buffer == NULL))
        return(-1);
@@ -818,12 +819,15 @@
     inbuf = reader->input->buffer;
 
     while (reader->state == XML_TEXTREADER_NONE) {
-       if (inbuf->use < reader->cur + CHUNK_SIZE) {
+       if (inbuf->use < reader->cur + csize) {
            /*
             * Refill the buffer unless we are at the end of the stream
             */
            if (reader->mode != XML_TEXTREADER_MODE_EOF) {
-               val = xmlParserInputBufferRead(reader->input, 4096);
+               int rsize = 4096;
+               if (rsize < csize)
+                   rsize += csize;
+               val = xmlParserInputBufferRead(reader->input, rsize);
                if ((val == 0) &&
                    (inbuf->alloc == XML_BUFFER_ALLOC_IMMUTABLE)) {
                    if (inbuf->use == reader->cur) {
@@ -849,11 +853,11 @@
         * parse by block of CHUNK_SIZE bytes, various tests show that
         * it's the best tradeoff at least on a 1.2GH Duron
         */
-       if (inbuf->use >= reader->cur + CHUNK_SIZE) {
+       if (inbuf->use >= reader->cur + csize) {
            val = xmlParseChunk(reader->ctxt,
                          (const char *) &inbuf->content[reader->cur],
-                         CHUNK_SIZE, 0);
-           reader->cur += CHUNK_SIZE;
+                         csize, 0);
+           reader->cur += csize;
            if ((val != 0) || (reader->ctxt->wellFormed == 0))
                return(-1);
        } else {
@@ -866,6 +870,8 @@
                return(-1);
            break;
        }
+       /* FIXME: should use a reasonable max size to avoids overflows on ill
cases */
+       csize *= 2;
     }
 
     /*

Patch #2: search for potential end of tag before parsing large tags
===================================================================
--- xmlreader.c (revision 7902)
+++ xmlreader.c (working copy)
@@ -809,6 +809,8 @@
     xmlBufferPtr inbuf;
     int val, s;
     xmlTextReaderState oldstate;
+    int csize = CHUNK_SIZE;
+    int search_tag_end;
 
     if ((reader->input == NULL) || (reader->input->buffer == NULL))
        return(-1);
@@ -816,9 +818,10 @@
     oldstate = reader->state;
     reader->state = XML_TEXTREADER_NONE;
     inbuf = reader->input->buffer;
+    search_tag_end = 0;
 
     while (reader->state == XML_TEXTREADER_NONE) {
-       if (inbuf->use < reader->cur + CHUNK_SIZE) {
+       if (inbuf->use < reader->cur + csize) {
            /*
             * Refill the buffer unless we are at the end of the stream
             */
@@ -840,8 +843,15 @@
                    /* mark the end of the stream and process the remains */
                    reader->mode = XML_TEXTREADER_MODE_EOF;
                    break;
+               } if ((val > 0) && (search_tag_end)) {
+                   const xmlChar *tmp, *end;
+                   tmp = &inbuf->content[inbuf->use-val];
+                   end = &inbuf->content[inbuf->use];
+                   while (*tmp != '>' && tmp < end) tmp++;
+                   csize = tmp - &inbuf->content[reader->cur] + 1;
+                   search_tag_end = (tmp == end);
                }
-
+               continue; /* ensure we have enough data */
            } else 
                break;
        }
@@ -849,11 +859,11 @@
         * parse by block of CHUNK_SIZE bytes, various tests show that
         * it's the best tradeoff at least on a 1.2GH Duron
         */
-       if (inbuf->use >= reader->cur + CHUNK_SIZE) {
+       if (inbuf->use >= reader->cur + csize) {
            val = xmlParseChunk(reader->ctxt,
                          (const char *) &inbuf->content[reader->cur],
-                         CHUNK_SIZE, 0);
-           reader->cur += CHUNK_SIZE;
+                         csize, 0);
+           reader->cur += csize;
            if ((val != 0) || (reader->ctxt->wellFormed == 0))
                return(-1);
        } else {
@@ -866,6 +876,22 @@
                return(-1);
            break;
        }
+
+       /*
+        * If nothing parsed on first pass try to find the end of a tag
+        * and pass at least that much to next chunk parse or trigger
+        * reading of input until we find a potential tag end
+        */
+       if (reader->state == XML_TEXTREADER_NONE) {
+           const xmlChar *tmp, *end;
+           tmp = &inbuf->content[reader->cur];
+           end = &inbuf->content[inbuf->use];
+           while (*tmp != '>' && tmp < end) tmp++;
+           csize = tmp - &inbuf->content[reader->cur] + 1;
+           if (csize < CHUNK_SIZE)
+               csize = CHUNK_SIZE;
+           search_tag_end = (tmp == end);
+       }
     }
 
     /*

--
Diego Santa Cruz, PhD
Technology Architect
_________________________________
SpinetiX S.A.
Rue des Terreaux 17
1003, Lausanne, Switzerland
T +41 21 341 15 50
F +41 21 311 19 56
diego santacruz spinetix com
http://www.spinetix.com
http://www.youtube.com/SpinetiXTeam
_________________________________




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]