Re: [xml] libxml2 XPath performances



Hi All,

let me contribute a bit to your discussion on this topic.

Daniel Veillard <veillard redhat com> writes:
On Thu, Nov 11, 2004 at 07:43:07PM +0100, Luca Padovani wrote:
<snip>

The problem is that text nodes occur very often (as simple
indentation or just because they are needed) and still they cannot be
compared in constant time. I tried to hack a bit the comparison function
exploiting the fact that text nodes occur very often as children of
element nodes (which can be compared quickly) and I could see some
significant improvement. I haven't submitted the patch though because it
looked dirty, and my personal thinking is that this issue should be
fixed in a more general and clean way and not exploiting some unused
fields in the xmlNode structure. As we know, real-world applications are

  I don't see why the xmlNode structure should be used. Actually it is
within libxslt, it uses xmlXPathOrderDocElems() which generates document
nodes order used in speeding up the process. If you run XPath queries
directly you will need the full scan, if you call xmlXPathOrderDocElems()
first, the comparison stops as soon as you compare the elements or the
parents elements.

1) Assuming xmlXPathOrderDocElems has been called, if the compared
nodes have different parents, then the nearest common ancestor is
looked for (which is O(depth of the tree)), whereas it would IMO
suffice to compare the parents (node->parent is always an element,
right?) using indexed doc-order (getting parents is O(1)), right?

2) Further more, some speedup with xmlXPathOrderDocElems could be done
even in the case when the compared nodes are not elements (e.g. text
nodes) but have the same parent. In that case, currently libxml2 does

    for (cur = node1->next;cur != NULL;cur = cur->next)
        if (cur == node2)
            return(1);
    return(-1); /* assume there is no sibling list corruption */

Utilizing the fact that non-indexed elements (by index I mean the
number assigned to them by xmlXPathOrderDocElems) are usually
surrounded by indexed elements, the above could be improved as
follows: as soon as we find a cur with an index, we could iterate from
node2 backward (->prev) and either we find cur, or we find another
node with an index. In the latter case we simply compare the indexes,
i.e. (not tested)

#define HAS_INDEX(node) ((node->type == XML_ELEMENT_NODE) && (0 > (long) node->content))
#define INDEX(node) ( -((long) node->content) )

    cur = node1->next;
    while (cur != NULL) {
      if (cur == node2) return(1);
      if (HAS_INDEX(cur)) {
         xmlNodePtr cur2;
         cur2 = node2->prev;
         while (cur2 != NULL) {
           if (cur2 == cur) return(1);
           if (HAS_INDEX(cur2)) {
             if (INDEX(cur1) < INDEX(cur2)) return(1);
             if (INDEX(cur1) > INDEX(cur2)) return(-1);
             return 0;
           }
           cur = cur->prev;
         }
         return(-1); /* assume there is no sibling list corruption */
      }
      cur = cur->next;

    }
    return(-1); /* assume there is no sibling list corruption */

This would make comparing sibling text nodes O(1) in most cases since
HAS_INDEX(node1->next) && HAS_INDEX(node2->prev) would hold in most
cases (since as I already mentioned, except for a few cases like
entities, processing instructions, and XInclude, text nodes are usually
surrounded by elements).

Thus, in an indexed document, the only remaining non-constant time
comparison would be the cases like: text &ent; text &ent; PI text ..

Opinions?

-- Petr

Attachment: pgpPO9z_Vhwed.pgp
Description: PGP signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]