Re: [xml] xmlXPathNodeSetSort performance

From: Daniel Veillard <veillard redhat com>
To: Vojtech Fried <vfried opentext com>
Cc: "xml gnome org" <xml gnome org>
Subject: Re: [xml] xmlXPathNodeSetSort performance
Date: Wed, 25 Jul 2012 17:49:36 +0800

On Wed, Jul 25, 2012 at 07:42:16AM +0000, Vojtech Fried wrote:

Hi,

  Hello,

I use libxml xpath engine on quite large (and mostly "flat") xml
files. It seems that Shellsort, that is used in xmlXPathNodeSetSort is a
performance bottleneck for my case. I have read some posts about sorting
in libxml in the libxml archive, but I agree that qsort was not the way
to go. I experimented with Timsort instead and my results were good for
me. For about 10000 nodes, my test was about 5x faster with Timsort,
for 1000 nodes about 10% faster, for small data files, the difference
was not measurable.

  Okay, out of curiosity what's the maximum amount of nodes in the
nodeset you are ever hitting ?

The patch is attached. It is not very polished, but it can be done on
demand. I took Timsort implementation from https://github.com/swenson/sort
(it has MIT Public License) The implementation uses its own
BINARY_INSERTION_SORT if the size of sorted data is less than 64. I had
to make it compile on msvc (msvc does not support C99, I had to backport
it to C89). I have also removed some unused sorting algorithms.

  Yeah we would need more polishing. I dislike the idea of having code
in a .h even though i understand that as a trial approach it's easier
for testing :-)

Timsort is able to utilize sorted runs of data (that is the reason
why Shellsort is used in libxml as far as I know). It is a widely used
algorithm, e.g. in python. The only disadvantage I know is that it
uses O(n) memory. It is not a problem for me, but I don't know about
others. If this is not acceptable, I could try to find some Smoothsort
implementation. Smoothsort has similar properties like Timsort, but uses
only O(1) memory. From what I have read, it is quite tricky to implement
correctly though.

  Err You mean that Timsort need to duplicate the nodelist somehow instead of
working directly on the given list, because overall I don't know data
structures which are O(1) (but i would have a good business model if you
got one :-) . Basically doubling the list memory requirement doesn't
sounds a big deal I agree.

Are there any performance tests for libxml I could run?

  Not really, it's used in so different ways that everyone has his own
criteria. See the dba100000.xml target in Makefile.am to generate an
arbitrary large file. But it's used only for Timingtests which only does
parsing timings, not XPath.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Follow-Ups:
- Re: [xml] xmlXPathNodeSetSort performance
  - From: Vojtech Fried
- Re: [xml] xmlXPathNodeSetSort performance
  - From: Vojtech Fried

References:
- [xml] xmlXPathNodeSetSort performance
  - From: Vojtech Fried

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]