Hi All, let me contribute a bit to your discussion on this topic. Daniel Veillard <veillard redhat com> writes:
On Thu, Nov 11, 2004 at 07:43:07PM +0100, Luca Padovani wrote:
<snip>
The problem is that text nodes occur very often (as simple indentation or just because they are needed) and still they cannot be compared in constant time. I tried to hack a bit the comparison function exploiting the fact that text nodes occur very often as children of element nodes (which can be compared quickly) and I could see some significant improvement. I haven't submitted the patch though because it looked dirty, and my personal thinking is that this issue should be fixed in a more general and clean way and not exploiting some unused fields in the xmlNode structure. As we know, real-world applications areI don't see why the xmlNode structure should be used. Actually it is within libxslt, it uses xmlXPathOrderDocElems() which generates document nodes order used in speeding up the process. If you run XPath queries directly you will need the full scan, if you call xmlXPathOrderDocElems() first, the comparison stops as soon as you compare the elements or the parents elements.
1) Assuming xmlXPathOrderDocElems has been called, if the compared nodes have different parents, then the nearest common ancestor is looked for (which is O(depth of the tree)), whereas it would IMO suffice to compare the parents (node->parent is always an element, right?) using indexed doc-order (getting parents is O(1)), right? 2) Further more, some speedup with xmlXPathOrderDocElems could be done even in the case when the compared nodes are not elements (e.g. text nodes) but have the same parent. In that case, currently libxml2 does for (cur = node1->next;cur != NULL;cur = cur->next) if (cur == node2) return(1); return(-1); /* assume there is no sibling list corruption */ Utilizing the fact that non-indexed elements (by index I mean the number assigned to them by xmlXPathOrderDocElems) are usually surrounded by indexed elements, the above could be improved as follows: as soon as we find a cur with an index, we could iterate from node2 backward (->prev) and either we find cur, or we find another node with an index. In the latter case we simply compare the indexes, i.e. (not tested) #define HAS_INDEX(node) ((node->type == XML_ELEMENT_NODE) && (0 > (long) node->content)) #define INDEX(node) ( -((long) node->content) ) cur = node1->next; while (cur != NULL) { if (cur == node2) return(1); if (HAS_INDEX(cur)) { xmlNodePtr cur2; cur2 = node2->prev; while (cur2 != NULL) { if (cur2 == cur) return(1); if (HAS_INDEX(cur2)) { if (INDEX(cur1) < INDEX(cur2)) return(1); if (INDEX(cur1) > INDEX(cur2)) return(-1); return 0; } cur = cur->prev; } return(-1); /* assume there is no sibling list corruption */ } cur = cur->next; } return(-1); /* assume there is no sibling list corruption */ This would make comparing sibling text nodes O(1) in most cases since HAS_INDEX(node1->next) && HAS_INDEX(node2->prev) would hold in most cases (since as I already mentioned, except for a few cases like entities, processing instructions, and XInclude, text nodes are usually surrounded by elements). Thus, in an indexed document, the only remaining non-constant time comparison would be the cases like: text &ent; text &ent; PI text .. Opinions? -- Petr
Attachment:
pgpPO9z_Vhwed.pgp
Description: PGP signature