Hi, I use libxml xpath engine on quite large (and mostly “flat”) xml files. It seems that Shellsort, that is used in xmlXPathNodeSetSort is a performance bottleneck for my case. I have read some posts about sorting
in libxml in the libxml archive, but I agree that qsort was not the way to go. I experimented with Timsort instead and my results were good for me. For about 10000 nodes, my test was about 5x faster with Timsort, for 1000 nodes about 10% faster, for small
data files, the difference was not measurable. The patch is attached. It is not very polished, but it can be done on demand. I took Timsort implementation from https://github.com/swenson/sort (it has MIT Public License) The implementation uses its own BINARY_INSERTION_SORT
if the size of sorted data is less than 64. I had to make it compile on msvc (msvc does not support C99, I had to backport it to C89). I have also removed some unused sorting algorithms. Timsort is able to utilize sorted runs of data (that is the reason why Shellsort is used in libxml as far as I know). It is a widely used algorithm, e.g. in python. The only disadvantage I know is that it uses
O(n) memory. It is not a problem for me, but I don’t know about others. If this is not acceptable, I could try to find some Smoothsort implementation. Smoothsort has similar properties like Timsort, but uses only O(1) memory. From what I have read, it is quite
tricky to implement correctly though. Are there any performance tests for libxml I could run? Vojtech |
Attachment:
SortOptimization.patch
Description: SortOptimization.patch