Re: [xml] XPath versus node walking performance.



Hi Jose,

On Wed, 2004-11-10 at 19:09 +0000, Jose Commins wrote:
      I've been looking into using XPath rather than walking nodes - does 
anyone have any experience in the performance differences between these 
two methods?

From my experience, it depends on the complexity of the paths you want
to walk. XPath is nice because it allows you to abstract from the
low-level walking operations, but its implementation in libxml2 is not
always satisfactory, especially if the evaluation of the expression
requires several node set merge operations. Unfortunately this
characteristic is not always easy to infer by looking at the XPath
expression alone.

Recently I've been comparing different implementations for querying XML
documents and the attached table (sorry for the ASCII art) shows some of
the results for the indicated queries on a large (and deep) XML
document. The absolute times are in milliseconds and refer to 20
evaluations of the indicated paths, excluding parsing time. The factor f
is the ratio matching time/(parsing time + matching time) and is meant
to give a very rough approximation of how well the tool behaves
excluding the "context" (implementation language, ...) In the table, PET
refers to a C++ library for representing regular path expressions that I
have implemented and that allows you to specify walking paths using a
high-level syntax and the operators of regular expressions, and it makes
the g++ compiler automatically generate the walking code (in terms of
libxml2 fields and walking methods).

In the tests, I was surprised to see that a simple path like //node()
takes a very long time in libxml2 but is very fast in xalan (on the
other side, xalan seems to be much slower than libxml2 for the XSLT
transformation, but this is a different story).

One point where libxml2 could be certainly improved is on the comparison
of nodes for determining which node comes first in document order. This
operation is important when merging node sets. If I understood the
implementation correctly, currently only element nodes can be compared
in constant time, whereas the other nodes often require expensive
traversal of the document tree (especially expensive if the document is
deep). The problem is that text nodes occur very often (as simple
indentation or just because they are needed) and still they cannot be
compared in constant time. I tried to hack a bit the comparison function
exploiting the fact that text nodes occur very often as children of
element nodes (which can be compared quickly) and I could see some
significant improvement. I haven't submitted the patch though because it
looked dirty, and my personal thinking is that this issue should be
fixed in a more general and clean way and not exploiting some unused
fields in the xmlNode structure. As we know, real-world applications are
always made of compromises, it's a fact that we have to accept.

Anyway, I don't want to cast a bad light on libxml2, which is by far my
most favorite implementation for parsing and transformation and XML. If
performances are really critical for you, then I would recommend you to
write the walking code manually (or generate it as it is done in PET).
If not, I'd prefer XPath for it's already there and requires very little
effort from the programmer.

Hope that helps,
--luca

Attachment: queries.txt
Description: Text document



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]