Hi, I use libxml xpath engine on quite large (and mostly “flat”) xml files. It seems that Shellsort, that is used in xmlXPathNodeSetSort is a bottleneck for my case. I have read some posts about sorting in libxml in the libxml
archive, but I agree that qsort was not the way to go. I experimented with Timsort instead and my results were good for me. For about 10000 nodes, my test was about 5x faster with Timsort, for 1000 nodes about 10% faster, for small data files, the difference was not measurable. The patch is attached. It is not very polished, but it can be done on demand. I took Timsort implementation from
https://github.com/swenson/sort (it has MIT Public License) The implementation uses its own BINARY_INSERTION_SORT if the size of sorted data is less than 64. I had to make it compile on msvc (msvc does not support
C99, I had to backport it to C89). I have also removed some unused sorting algorithms. Timsort is able to utilize sorted runs of data (that is the reason why Shellsort is used in libxml as far as I know). It is a widely used algorithm, e.g. in python. The only disadvantage is that it uses O(n) memory. It
is not a problem for me, but I don’t know about others. If this is not acceptable, I could try to find some Smoothsort implementation. Smoothsort has similar properties like Timsort, but uses only O(1) memory. From what I have read, it is quite tricky to implement
correctly though. Are there any performance tests for libxml I could run? Any comments? Vojtech |
Attachment:
SortOptimization.patch
Description: SortOptimization.patch