Hi Jose, On Wed, 2004-11-10 at 19:09 +0000, Jose Commins wrote:
I've been looking into using XPath rather than walking nodes - does anyone have any experience in the performance differences between these two methods?
From my experience, it depends on the complexity of the paths you want
to walk. XPath is nice because it allows you to abstract from the low-level walking operations, but its implementation in libxml2 is not always satisfactory, especially if the evaluation of the expression requires several node set merge operations. Unfortunately this characteristic is not always easy to infer by looking at the XPath expression alone. Recently I've been comparing different implementations for querying XML documents and the attached table (sorry for the ASCII art) shows some of the results for the indicated queries on a large (and deep) XML document. The absolute times are in milliseconds and refer to 20 evaluations of the indicated paths, excluding parsing time. The factor f is the ratio matching time/(parsing time + matching time) and is meant to give a very rough approximation of how well the tool behaves excluding the "context" (implementation language, ...) In the table, PET refers to a C++ library for representing regular path expressions that I have implemented and that allows you to specify walking paths using a high-level syntax and the operators of regular expressions, and it makes the g++ compiler automatically generate the walking code (in terms of libxml2 fields and walking methods). In the tests, I was surprised to see that a simple path like //node() takes a very long time in libxml2 but is very fast in xalan (on the other side, xalan seems to be much slower than libxml2 for the XSLT transformation, but this is a different story). One point where libxml2 could be certainly improved is on the comparison of nodes for determining which node comes first in document order. This operation is important when merging node sets. If I understood the implementation correctly, currently only element nodes can be compared in constant time, whereas the other nodes often require expensive traversal of the document tree (especially expensive if the document is deep). The problem is that text nodes occur very often (as simple indentation or just because they are needed) and still they cannot be compared in constant time. I tried to hack a bit the comparison function exploiting the fact that text nodes occur very often as children of element nodes (which can be compared quickly) and I could see some significant improvement. I haven't submitted the patch though because it looked dirty, and my personal thinking is that this issue should be fixed in a more general and clean way and not exploiting some unused fields in the xmlNode structure. As we know, real-world applications are always made of compromises, it's a fact that we have to accept. Anyway, I don't want to cast a bad light on libxml2, which is by far my most favorite implementation for parsing and transformation and XML. If performances are really critical for you, then I would recommend you to write the walking code manually (or generate it as it is done in PET). If not, I'd prefer XPath for it's already there and requires very little effort from the programmer. Hope that helps, --luca
Attachment:
queries.txt
Description: Text document