Re: [xml] [Patch] Optimizing '//' in XPath expressions



  Hi Nick,

On Sun, Aug 19, 2012 at 07:42:38PM +0200, Nick Wellnhofer wrote:
When investigating the libxslt performance problem reported in bug
#657665, I found that '//' in XPath expressions can be very slow
when working on large subtrees.

  I guess everybody agree on that statement :-)

One of the reasons is the seemingly quadratic time complexity of the
duplicate checks when merging result nodes. The other is a missed
optimization for expressions of the form
'descendant-or-self::node()/axis::test'. Since '//' is expanded to
'/descendant-or-self::node()/', this type of expression is quite
common. Depending on the axis of the expression following the
'descendant-or-self' step, the following replacements can be made:

from descendant-or-self::node()/child::test
to   descendant::test

from descendant-or-self::node()/descendant::test
to   descendant::test

from descendant-or-self::node()/self::test
to   descendant-or-self::test

from descendant-or-self::node()/descendant-or-self::test
to   descendant-or-self::test

'test' can be any kind of node test.

  That sounds right !
All those are forward progressing axis so the order in the node set
is also preserved.

With these replacements the possibly huge result of
'descendant-or-self::node()' doesn't have to be stored temporarily,
but can be processsed in one pass. If the resulting nodeset is
small, the duplicate checks aren't a problem.

  yes, as I said there is very little optimization in the
XPath engine and as you pointed out the way expressions are compiled
is far from optimal, i just followed the spec description

I found that there already is a function called
xmlXPathRewriteDOSExpression which performs this optimization for a
very limited set of cases. It employs a complicated iteration scheme
for rewritten expressions. AFAICS, this can be avoided by simply
changing the axis of the expression like described above.

  yes I think at the time I was a bit afraid of really modifying the
compiled tree but the 4 expressions above are clearly big enhancements.
I suspect it's just the top of the iceberg, there is a number of other
post-compilation optimization which can certainly be made, but with
less drastic improvements.

With the attached patch against libxml2 and the files from bug
#657665 I got the following results.

Before:

$ time xsltproc/xsltproc --noout service-names-port-numbers.xsl
service-names-port-numbers.xml
real    2m56.213s
user    2m56.123s
sys     0m0.080s

After:

$ time xsltproc/xsltproc --noout service-names-port-numbers.xsl
service-names-port-numbers.xml
real    0m3.836s
user    0m3.764s
sys     0m0.060s

I also ran the libxml2 and libxslt test suites with the patch and
couldn't detect any breakage.

  Impressive, a patch which removes more cruft than it adds and
provide an order of magnitude improvement! please send more patches :-)

    Gratefully reviewed, applied and pushed,

      thanks a lot !

Now I still need to review/apply the sort improvements...

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]