Re: [xml] [PATCH] Use buffers when constructing string node lists.



On Thu, May 10, 2012 at 08:17:25PM -0700, Conrad Irwin wrote:
Hi Veillard and all,

Firstly, thanks for libxml: it's awesome!

I noticed recently that libxml was taking a surprisingly long time to perform some
operations (many minutes instead of milliseconds), and so I did some digging. It turns out
that the problem was caused by the realloc()ing done in xmlNodeAddContentLen() which can
be called many (many) times when assigning some content into a node.

For background, I'm dealing with XML that contains emails, these can have large
attachments (~6MB) which are base-64 encoded, line-wrapped at 78 chars, and each line ends
with 
. This means that xmlNodeAddContentLen() is being called about 200,000 times,
and so there are 200,000 reallocs of a 6MB string, which takes a while... (I put a synthetic
example of this at https://gist.github.com/2656940)

The attached patch works around that problem by using the existing buffer API to merge the
strings together before even creating the text node, this keeps the number of realloc()s
at a managable level.

I'd love feedback on the patch, and am happy to fix problems with it, or explore other
solutions if you think that this is barking up the wrong tree :).

  Hi Conrad,

that's interesting ! I was initially afraid of a sudden explosion of
memory allocations for building a tree since by default buffers tend to
"waste" memory by using doubling allocations, but that's not the case.
  xmllint --noout doc/libxml2-api.xml
when compiled with memory debug produce

paphio:~/XML -> cat .memdump
      MEMORY ALLOCATED : 0, MAX was 12756699

and without your patch 12755657, i.e. the increase is minimal.

There is also the cost of creating the buffers all the time.
I need to read the code and check but I may be interested in an hybrid
approach where we switch to buffer only when the text node starts to
become too big (4k would remove nearly all usuall types of "document"
usage, i.e. not blocks of data)

P.S. Should I create a bug for this too?

  Hum, yes for tracking though I prefer to interract through the list
  :-)

   thanks !

Daniel


-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]