I have modified libxml.py (the handmade part of libxml2.py) to make use of some new features of python 2.2 , namely 1) replaced __gettattr__() with property() 2) added an __iter__() method that returns an iterator over subtree of node The result of replacing __gettattr__() with property() is a 2x speedup when iterating over tree using node.next and node.children An additional 12% speedup can be gained by not having a separate xmlCore class (it is used only by xmlNode) but including the hand-made part directly in definition of class xmlNode. i tested it by running the following test function over 75000 nodes of old testament import libxml2, time ot = libxml2.parseFile('XML-SAMPLES/religion/ot/ot.xml') def nodecount(doc): n=0 t1 = time.time() for node in doc: n += 1 t2 = time.time() print 'nodes: %s' % n print 'time : %.2f' % t2-t1 print 'nodes/sec %.2f' % (n /(t2-t1)) with current libxml with added __iter_ it produces:
nodecount(ot)
nodes: 74957 time : 27.16 nodes/sec 2760.39 with my modifications (inc removing xmlCore) the result is
nodecount(ot)
nodes: 74957 time : 12.87 nodes/sec 5824.71 I have some questions about the code though - 1) why does current code have duplicate defs of doc(self) ? 2) what is the alternate name getContent for get_content used for ? do some tests depend on it ? 3) should nodeWrap(o) return an xmlAttr for type "dtd" ? I changed my code to return an xmlDtd, is it the right thing to do ? I guess an order-of-magnitude speedup could be gained by moving the whole libxml2.py to a C module. Are there any plans of doing so ? -- Hannu Krosing <hannu tm ee>
Attachment:
libxml_py22.py
Description: Text Data