[xml] xmlXPathNodeSetSort performance



Hi,

 

I use libxml xpath engine on quite large (and mostly “flat”) xml files. It seems that Shellsort, that is used in xmlXPathNodeSetSort is a bottleneck for my case. I have read some posts about sorting in libxml in the libxml archive, but I agree that qsort was not the way to go.

I experimented with Timsort instead and my results were good for me. For about 10000 nodes, my test was about 5x faster with Timsort, for 1000 nodes about 10% faster, for small data files, the difference was not measurable.

 

The patch is attached. It is not very polished, but it can be done on demand. I took Timsort implementation from https://github.com/swenson/sort (it has MIT Public License) The implementation uses its own BINARY_INSERTION_SORT if the size of sorted data is less than 64. I had to make it compile on msvc (msvc does not support C99, I had to backport it to C89). I have also removed some unused sorting algorithms.

 

Timsort is able to utilize sorted runs of data (that is the reason why Shellsort is used in libxml as far as I know). It is a widely used algorithm, e.g. in python. The only disadvantage is that it uses O(n) memory. It is not a problem for me, but I don’t know about others. If this is not acceptable, I could try to find some Smoothsort implementation. Smoothsort has similar properties like Timsort, but uses only O(1) memory. From what I have read, it is quite tricky to implement correctly though.

 

Are there any performance tests for libxml I could run? Any comments?

 

Vojtech

Attachment: SortOptimization.patch
Description: SortOptimization.patch



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]