I have modified libxml.py (the handmade part of libxml2.py) to make
use of some new features of python 2.2 , namely
1) replaced __gettattr__() with property()
2) added an __iter__() method that returns an iterator over subtree of
node
The result of replacing __gettattr__() with property() is a 2x speedup
when iterating over tree using node.next and node.children
An additional 12% speedup can be gained by not having a separate xmlCore
class (it is used only by xmlNode) but including the hand-made part
directly in definition of class xmlNode.
i tested it by running the following test function over 75000 nodes of
old testament
import libxml2, time
ot = libxml2.parseFile('XML-SAMPLES/religion/ot/ot.xml')
def nodecount(doc):
n=0
t1 = time.time()
for node in doc:
n += 1
t2 = time.time()
print 'nodes: %s' % n
print 'time : %.2f' % t2-t1
print 'nodes/sec %.2f' % (n /(t2-t1))
with current libxml with added __iter_ it produces:
nodecount(ot)
nodes: 74957 time : 27.16 nodes/sec 2760.39 with my modifications (inc removing xmlCore) the result is
nodecount(ot)
nodes: 74957 time : 12.87 nodes/sec 5824.71 I have some questions about the code though - 1) why does current code have duplicate defs of doc(self) ? 2) what is the alternate name getContent for get_content used for ? do some tests depend on it ? 3) should nodeWrap(o) return an xmlAttr for type "dtd" ? I changed my code to return an xmlDtd, is it the right thing to do ? I guess an order-of-magnitude speedup could be gained by moving the whole libxml2.py to a C module. Are there any plans of doing so ? -- Hannu Krosing <hannu tm ee>
Attachment:
libxml_py22.py
Description: Text Data