Re: [xml] XPath performance issues

From: Daniel Veillard <veillard redhat com>
To: Stefan Behnel <stefan_ml behnel de>
Cc: xml <xml gnome org>
Subject: Re: [xml] XPath performance issues
Date: Thu, 10 Nov 2011 16:54:38 +0800

On Fri, Nov 04, 2011 at 11:32:19AM +0100, Stefan Behnel wrote:

Hi,

almost exactly two years ago, I brought up the topic of the
surprisingly unpredictable XPath performance on this list (thread
titled "confusing xpath performance characteristics", without
response at the time). The problem is not the actual search, but the
merging of node sets after the fact. The effect is that a detailed
expression like

     /xs:schema/xs:complexType//xs:element[ name="equity"]/@type

is several orders of magnitude slower than the less constrained expression

    //xs:element[ name="equity"]/@type

The problem here is that the evaluator finds several matches for
"xs:complexType", searches each subtree, and then goes into merging
the subresults and removing duplicates along the way, using a
quadratic algorithm. The runtime of this approach quickly explodes
with the number of nodes in the node set, especially since it can
get applied several times while going up the expression tree.


  yeah I agree that's far from optimal. What I tried to do is speed
up the comparison of nodes if one calls xmlXPathOrderDocElems() on the
document before XPath or XSLT processing on it.
  It's definitely a known issue, c.f. comment on line 12055 of xpath.c
there are trivial cases where libxml2 will use
  xmlXPathNodeSetMergeAndClearNoDupls()
when we know we can't reach duplicates, when walking axis child,
attributes or namespaces, but for anything else one need more context
to be able to make the optimization.

There are several surprising expressions where this shows, e.g.

    descendant-or-self::dl/descendant::dt

is very fast, whereas the semantically only slightly different

    descendant-or-self::dl/descendant-or-self::*/dt

is already quite a bit slower, but the equivalent

    descendant-or-self::dl//dt

is orders of magnitude slower. You can test them against the 4.7MB
HTML5 spec page at

http://www.w3.org/TR/html5/Overview.html

The last approach takes literally hours, whereas the first two
finish within the order of seconds. I ran this within callgrind, and
the top function that takes 99.4% of the overall runtime is
xmlXPathNodeSetMergeAndClear(), specifically the inner loop starting
with the comment "skip duplicates".

There are two issues here. One is that the duplicate removal uses
the easiest to implement, and unluckily also slowest algorithm. This
is understandable because doing better is not trivial at all in C.
However, the algorithm could be improved quite substantially, e.g.
by using merge sort (even based on an arbitrary sort criteria like
the node address). If eventual sorting in document order is
required, a merge sort is the best thing to do anyway, as it could
be applied recursively along the XPath evaluation tree, thus always
yielding sorted node sets at each stage.

The second issue is that the duplicate removal is not necessary at
all in a wide variety of important cases. As long as the
subexpression on the right does not access any parents (which likely
applies to >95% of the real world XPath expressions), i.e. the
subresults originate from distinct subtrees, there can be no
duplicates, and the subresults are automatically in document order.
Thus, the merging will collapse into a simple concatenation. I admit
that this case will sometimes be hard to detect because the left
part of the expression may already have matched overlapping parts of
the tree. But I think it is worth more than just a try.


  Completely agreed, but goal was to ensure correctness, and then
optimize, problem is that few optimizations were made.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

References:
- [xml] XPath performance issues
  - From: Stefan Behnel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]