[xml] XPath performance issues
- From: Stefan Behnel <stefan_ml behnel de>
- To: xml <xml gnome org>
- Subject: [xml] XPath performance issues
- Date: Fri, 04 Nov 2011 11:32:19 +0100
Hi,
almost exactly two years ago, I brought up the topic of the surprisingly
unpredictable XPath performance on this list (thread titled "confusing
xpath performance characteristics", without response at the time). The
problem is not the actual search, but the merging of node sets after the
fact. The effect is that a detailed expression like
/xs:schema/xs:complexType//xs:element[ name="equity"]/@type
is several orders of magnitude slower than the less constrained expression
//xs:element[ name="equity"]/@type
The problem here is that the evaluator finds several matches for
"xs:complexType", searches each subtree, and then goes into merging the
subresults and removing duplicates along the way, using a quadratic
algorithm. The runtime of this approach quickly explodes with the number of
nodes in the node set, especially since it can get applied several times
while going up the expression tree.
There are several surprising expressions where this shows, e.g.
descendant-or-self::dl/descendant::dt
is very fast, whereas the semantically only slightly different
descendant-or-self::dl/descendant-or-self::*/dt
is already quite a bit slower, but the equivalent
descendant-or-self::dl//dt
is orders of magnitude slower. You can test them against the 4.7MB HTML5
spec page at
http://www.w3.org/TR/html5/Overview.html
The last approach takes literally hours, whereas the first two finish
within the order of seconds. I ran this within callgrind, and the top
function that takes 99.4% of the overall runtime is
xmlXPathNodeSetMergeAndClear(), specifically the inner loop starting with
the comment "skip duplicates".
There are two issues here. One is that the duplicate removal uses the
easiest to implement, and unluckily also slowest algorithm. This is
understandable because doing better is not trivial at all in C. However,
the algorithm could be improved quite substantially, e.g. by using merge
sort (even based on an arbitrary sort criteria like the node address). If
eventual sorting in document order is required, a merge sort is the best
thing to do anyway, as it could be applied recursively along the XPath
evaluation tree, thus always yielding sorted node sets at each stage.
The second issue is that the duplicate removal is not necessary at all in a
wide variety of important cases. As long as the subexpression on the right
does not access any parents (which likely applies to >95% of the real world
XPath expressions), i.e. the subresults originate from distinct subtrees,
there can be no duplicates, and the subresults are automatically in
document order. Thus, the merging will collapse into a simple
concatenation. I admit that this case will sometimes be hard to detect
because the left part of the expression may already have matched
overlapping parts of the tree. But I think it is worth more than just a try.
Stefan
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]