[xml] libxml2 XPath performances



[ fixing the post title my answer is a bit more generic ]

On Thu, Nov 11, 2004 at 07:43:07PM +0100, Luca Padovani wrote:
    Hi Luca,

  I had a very hard time getting asleep yesterday, and among the problems
haunting me were that XPath performance "problem", see below:

Recently I've been comparing different implementations for querying XML
documents and the attached table (sorry for the ASCII art) shows some of
the results for the indicated queries on a large (and deep) XML
document. The absolute times are in milliseconds and refer to 20
evaluations of the indicated paths, excluding parsing time. The factor f
is the ratio matching time/(parsing time + matching time) and is meant
to give a very rough approximation of how well the tool behaves
excluding the "context" (implementation language, ...) In the table, PET
refers to a C++ library for representing regular path expressions that I
have implemented and that allows you to specify walking paths using a
high-level syntax and the operators of regular expressions, and it makes
the g++ compiler automatically generate the walking code (in terms of
libxml2 fields and walking methods).

In the tests, I was surprised to see that a simple path like //node()
takes a very long time in libxml2 but is very fast in xalan (on the
other side, xalan seems to be much slower than libxml2 for the XSLT
transformation, but this is a different story).

   yes ...
In general libxml2 XPath is not optimized, for example consider

    paphio:~/XML -> ./testXPath --tree "//foo"
    Compiled Expression : 4 elements
      SORT
        COLLECT  'child' 'name' 'node' foo
          COLLECT  'descendant-or-self' 'type' 'node'
            ROOT
    Object is a Node Set :
    Set contains 0 nodes:
    paphio:~/XML ->

What does this mean ? First libxml2 compiles the expression to a tree
used for evaluation, and --tree dumps the tree. Check the COLLECT operation
they are filter passes done on nodeset results, they are run by
xmlXPathNodeCollectAndTest() (or xmlXPathNodeCollectAndTestNth()).

What does that mean ? That for //foo libxml2 first build the set of all
the nodes in the document, and then filter the temporary (huge) node set
for element with children nodes. Clearly the 2 operations should be
gathered in one, and xmlXPathNodeCollectAndTest() which operates on an
iterator function (like xmlXPathNextDescendantOrSelf()) and filters
(like the name of the node) could do both passes in one, it's just 
that the generated compiled tree is not optimal.

There is a number of relatively simple optimizations which could be done
and at various level:
   1/ the evaluation tree generation //foo is just one example. There is 
      already some dynamic evaluation reconfiguration of the tree at execution
      (look for xmlXPathCompSwap(op) if interested)
   2/ another huge gain possible is the use of the dictionnary, basically
      before interpreting the compiled expression against a document, and
      assuming the document use a dictionnary for all its nodes, rewriting
      the query to use strings from the document dictionnary, and run in an
      optimized mode where xmlStrcmp are replaced by pointer comparisons.
      if would fail if the document was modified via the API introducing names
      outside the dictionnary, so this should be kept as an option but would
      work just fine in general for libxslt.
   3/ the last simple and potentially huge gain is about ordering of node
      set at the "SORT" level of the evaluation. Basically based on properties
      of your current node set (one element, or already sorted in document
      order - or reverse -) some of the axis would keep some of these 
      properties. There is 2 optimizations possible there: 
         a/ if you know new node collected in the generated node set at the 
            "COLLECT" level will be unique then use xmlXPathNodeSetAddUnique
            and avoid a lot of processing xmlXPathNodeCollectAndTest() does
            taht already to some extent but this is far from complete
         b/ keeping that nodeset ordering information in the xpath evaluation
            context would allow the SORT level to be skipped like for
            example in //foo, at both COLLECT step you can infer the node
            set is already sorted in document order

One point where libxml2 could be certainly improved is on the comparison
of nodes for determining which node comes first in document order. This

  or like in 3/ in most case I suppose that comparison could simply be
skipped.

operation is important when merging node sets. If I understood the
implementation correctly, currently only element nodes can be compared
in constant time, whereas the other nodes often require expensive
traversal of the document tree (especially expensive if the document is
deep).

  yes but I think there is some basic optimizations when nodes are children
of the same elements.

The problem is that text nodes occur very often (as simple
indentation or just because they are needed) and still they cannot be
compared in constant time. I tried to hack a bit the comparison function
exploiting the fact that text nodes occur very often as children of
element nodes (which can be compared quickly) and I could see some
significant improvement. I haven't submitted the patch though because it
looked dirty, and my personal thinking is that this issue should be
fixed in a more general and clean way and not exploiting some unused
fields in the xmlNode structure. As we know, real-world applications are

  I don't see why the xmlNode structure should be used. Actually it is
within libxslt, it uses xmlXPathOrderDocElems() which generates document
nodes order used in speeding up the process. If you run XPath queries
directly you will need the full scan, if you call xmlXPathOrderDocElems()
first, the comparison stops as soon as you compare the elements or the
parents elements.

Anyway, I don't want to cast a bad light on libxml2, which is by far my
most favorite implementation for parsing and transformation and XML. If
performances are really critical for you, then I would recommend you to
write the walking code manually (or generate it as it is done in PET).
If not, I'd prefer XPath for it's already there and requires very little
effort from the programmer.

  As I pointed in this mail there is a lot to be done which can dramatically
improve the XPath performances and probably in minimal coding. Each of
them could be a fun week-end project for someone who looked at xpath.c
already. There is even more to be gained I think using dictionnaries at 
the libxslt level, for example using the input document dictionnary to
automatically discard template rules requiring names that are not in the
document dictionnary. That could be done at the XPath level in when generating
the docuemnt optimized query, for example:

paphio:~/XML -> ./testXPath --tree "//foo[ bar='1']"
Compiled Expression : 10 elements
  SORT
    COLLECT  'child' 'name' 'node' foo
      COLLECT  'descendant-or-self' 'type' 'node'
        ROOT
      PREDICATE
        SORT
          EQUAL =
            COLLECT  'attributes' 'name' 'node' bar
              NODE
            ELEM Object is a string : 1
              COLLECT  'attributes' 'name' 'node' bar
                NODE
Object is a Node Set :
Set contains 0 nodes:
paphio:~/XML ->

  if the document dictionnary doesn't have 'bar' then you can do static
optimization, this for example you know COLLECT  'attributes' 'name' 'node' bar
will return an empty node set, so EQUAL to a string '1' is garanteed to 
fail, so the predicate will fail too and you can optimize the whole
query as returning EMPTY ...

  yes there is a lot of fun stuff which can be done to optimize libxml2 XPath
some are pure bastard code optimizations, and others are more high level
academic like query optimizations.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]