[xml] XPath performance issues

From: Stefan Behnel <stefan_ml behnel de>
To: xml <xml gnome org>
Subject: [xml] XPath performance issues
Date: Fri, 04 Nov 2011 11:32:19 +0100

Hi,

almost exactly two years ago, I brought up the topic of the surprisinglyunpredictable XPath performance on this list (thread titled "confusingxpath performance characteristics", without response at the time). Theproblem is not the actual search, but the merging of node sets after thefact. The effect is that a detailed expression like


     /xs:schema/xs:complexType//xs:element[ name="equity"]/@type

is several orders of magnitude slower than the less constrained expression

    //xs:element[ name="equity"]/@type

The problem here is that the evaluator finds several matches for"xs:complexType", searches each subtree, and then goes into merging thesubresults and removing duplicates along the way, using a quadraticalgorithm. The runtime of this approach quickly explodes with the number ofnodes in the node set, especially since it can get applied several timeswhile going up the expression tree.


There are several surprising expressions where this shows, e.g.

    descendant-or-self::dl/descendant::dt

is very fast, whereas the semantically only slightly different

    descendant-or-self::dl/descendant-or-self::*/dt

is already quite a bit slower, but the equivalent

    descendant-or-self::dl//dt

is orders of magnitude slower. You can test them against the 4.7MB HTML5spec page at


http://www.w3.org/TR/html5/Overview.html

The last approach takes literally hours, whereas the first two finishwithin the order of seconds. I ran this within callgrind, and the topfunction that takes 99.4% of the overall runtime isxmlXPathNodeSetMergeAndClear(), specifically the inner loop starting withthe comment "skip duplicates".

There are two issues here. One is that the duplicate removal uses theeasiest to implement, and unluckily also slowest algorithm. This isunderstandable because doing better is not trivial at all in C. However,the algorithm could be improved quite substantially, e.g. by using mergesort (even based on an arbitrary sort criteria like the node address). Ifeventual sorting in document order is required, a merge sort is the bestthing to do anyway, as it could be applied recursively along the XPathevaluation tree, thus always yielding sorted node sets at each stage.

The second issue is that the duplicate removal is not necessary at all in awide variety of important cases. As long as the subexpression on the rightdoes not access any parents (which likely applies to >95% of the real worldXPath expressions), i.e. the subresults originate from distinct subtrees,there can be no duplicates, and the subresults are automatically indocument order. Thus, the merging will collapse into a simpleconcatenation. I admit that this case will sometimes be hard to detectbecause the left part of the expression may already have matchedoverlapping parts of the tree. But I think it is worth more than just a try.


Stefan

Follow-Ups:
- Re: [xml] XPath performance issues
  - From: Stefan Behnel
- Re: [xml] XPath performance issues
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]