[xml] XPath performance issues



Hi,

almost exactly two years ago, I brought up the topic of the surprisingly unpredictable XPath performance on this list (thread titled "confusing xpath performance characteristics", without response at the time). The problem is not the actual search, but the merging of node sets after the fact. The effect is that a detailed expression like

     /xs:schema/xs:complexType//xs:element[ name="equity"]/@type

is several orders of magnitude slower than the less constrained expression

    //xs:element[ name="equity"]/@type

The problem here is that the evaluator finds several matches for "xs:complexType", searches each subtree, and then goes into merging the subresults and removing duplicates along the way, using a quadratic algorithm. The runtime of this approach quickly explodes with the number of nodes in the node set, especially since it can get applied several times while going up the expression tree.

There are several surprising expressions where this shows, e.g.

    descendant-or-self::dl/descendant::dt

is very fast, whereas the semantically only slightly different

    descendant-or-self::dl/descendant-or-self::*/dt

is already quite a bit slower, but the equivalent

    descendant-or-self::dl//dt

is orders of magnitude slower. You can test them against the 4.7MB HTML5 spec page at

http://www.w3.org/TR/html5/Overview.html

The last approach takes literally hours, whereas the first two finish within the order of seconds. I ran this within callgrind, and the top function that takes 99.4% of the overall runtime is xmlXPathNodeSetMergeAndClear(), specifically the inner loop starting with the comment "skip duplicates".

There are two issues here. One is that the duplicate removal uses the easiest to implement, and unluckily also slowest algorithm. This is understandable because doing better is not trivial at all in C. However, the algorithm could be improved quite substantially, e.g. by using merge sort (even based on an arbitrary sort criteria like the node address). If eventual sorting in document order is required, a merge sort is the best thing to do anyway, as it could be applied recursively along the XPath evaluation tree, thus always yielding sorted node sets at each stage.

The second issue is that the duplicate removal is not necessary at all in a wide variety of important cases. As long as the subexpression on the right does not access any parents (which likely applies to >95% of the real world XPath expressions), i.e. the subresults originate from distinct subtrees, there can be no duplicates, and the subresults are automatically in document order. Thus, the merging will collapse into a simple concatenation. I admit that this case will sometimes be hard to detect because the left part of the expression may already have matched overlapping parts of the tree. But I think it is worth more than just a try.

Stefan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]