Re: [xml] libxml2 XPath performances

From: Daniel Veillard <veillard redhat com>
To: Petr Pajas <pajas ufal ms mff cuni cz>
Cc: Luca Padovani <lpadovan cs unibo it>, xml gnome org, Jose Commins <axora myrealbox com>
Subject: Re: [xml] libxml2 XPath performances
Date: Fri, 12 Nov 2004 06:51:25 -0500

On Fri, Nov 12, 2004 at 12:40:22PM +0100, Petr Pajas wrote:

Hi All,

let me contribute a bit to your discussion on this topic.

Daniel Veillard <veillard redhat com> writes:

On Thu, Nov 11, 2004 at 07:43:07PM +0100, Luca Padovani wrote:

<snip>

The problem is that text nodes occur very often (as simple
indentation or just because they are needed) and still they cannot be
compared in constant time. I tried to hack a bit the comparison function
exploiting the fact that text nodes occur very often as children of
element nodes (which can be compared quickly) and I could see some
significant improvement. I haven't submitted the patch though because it
looked dirty, and my personal thinking is that this issue should be
fixed in a more general and clean way and not exploiting some unused
fields in the xmlNode structure. As we know, real-world applications are

  I don't see why the xmlNode structure should be used. Actually it is
within libxslt, it uses xmlXPathOrderDocElems() which generates document
nodes order used in speeding up the process. If you run XPath queries
directly you will need the full scan, if you call xmlXPathOrderDocElems()
first, the comparison stops as soon as you compare the elements or the
parents elements.

1) Assuming xmlXPathOrderDocElems has been called, if the compared
nodes have different parents, then the nearest common ancestor is
looked for (which is O(depth of the tree)), whereas it would IMO
suffice to compare the parents (node->parent is always an element,
right?) using indexed doc-order (getting parents is O(1)), right?

  No, if xmlXPathOrderDocElems() has been called I would expect the 
comparison to be O(1) for nodes on different elements, a list walk
is still needed if for examples the nodes are siblings on a same element.
Maybe libxml2 does not do the optimum there.

2) Further more, some speedup with xmlXPathOrderDocElems could be done
even in the case when the compared nodes are not elements (e.g. text
nodes) but have the same parent. In that case, currently libxml2 does

    for (cur = node1->next;cur != NULL;cur = cur->next)
      if (cur == node2)
          return(1);
    return(-1); /* assume there is no sibling list corruption */

Utilizing the fact that non-indexed elements (by index I mean the
number assigned to them by xmlXPathOrderDocElems) are usually
surrounded by indexed elements, the above could be improved as
follows: as soon as we find a cur with an index, we could iterate from
node2 backward (->prev) and either we find cur, or we find another
node with an index. In the latter case we simply compare the indexes,
i.e. (not tested)

#define HAS_INDEX(node) ((node->type == XML_ELEMENT_NODE) && (0 > (long) node->content))
#define INDEX(node) ( -((long) node->content) )

    cur = node1->next;
    while (cur != NULL) {
      if (cur == node2) return(1);
      if (HAS_INDEX(cur)) {
         xmlNodePtr cur2;
         cur2 = node2->prev;
         while (cur2 != NULL) {
           if (cur2 == cur) return(1);
           if (HAS_INDEX(cur2)) {
           if (INDEX(cur1) < INDEX(cur2)) return(1);
           if (INDEX(cur1) > INDEX(cur2)) return(-1);
             return 0;
           }
           cur = cur->prev;
         }
         return(-1); /* assume there is no sibling list corruption */
      }
      cur = cur->next;

    }
    return(-1); /* assume there is no sibling list corruption */

This would make comparing sibling text nodes O(1) in most cases since
HAS_INDEX(node1->next) && HAS_INDEX(node2->prev) would hold in most
cases (since as I already mentioned, except for a few cases like
entities, processing instructions, and XInclude, text nodes are usually
surrounded by elements).

  Interesting, yes, I would suggest trying that in the code and running the
libxml2 and libxslt regression tests to see if this actually works :-)

Thus, in an indexed document, the only remaining non-constant time
comparison would be the cases like: text &ent; text &ent; PI text ..

Opinions?

  Try it and send a patch :-)

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Follow-Ups:
- Re: [xml] libxml2 XPath performances
  - From: Petr Pajas

References:
- [xml] XPath versus node walking performance.
  - From: Jose Commins
- Re: [xml] XPath versus node walking performance.
  - From: Luca Padovani
- [xml] libxml2 XPath performances
  - From: Daniel Veillard
- Re: [xml] libxml2 XPath performances
  - From: Petr Pajas

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]