[xml] faster replacement libxml.py for python 2.2




I have modified libxml.py (the handmade part of libxml2.py) to make 
use of some new features of python 2.2 , namely

1) replaced __gettattr__() with property()

2) added an __iter__() method that returns an iterator over subtree of
node

The result of replacing __gettattr__() with property() is a 2x speedup
when iterating over tree using node.next and node.children

An additional 12% speedup can be gained by not having a separate xmlCore
class (it is used only by xmlNode) but including the hand-made part
directly in definition of class xmlNode.

i tested it by running the following test function over 75000 nodes of
old testament

import libxml2, time
ot = libxml2.parseFile('XML-SAMPLES/religion/ot/ot.xml')

def nodecount(doc):
    n=0
    t1 = time.time()
    for node in doc:
        n += 1
    t2 = time.time()
    print 'nodes: %s' % n
    print 'time : %.2f' % t2-t1
    print 'nodes/sec %.2f' % (n /(t2-t1))


with current libxml with added __iter_ it produces:

nodecount(ot)
nodes: 74957
time : 27.16
nodes/sec 2760.39


with my modifications (inc removing xmlCore) the result is

nodecount(ot)
nodes: 74957
time : 12.87
nodes/sec 5824.71


I have some questions about the code though -

1) why does current code have duplicate defs of doc(self) ?

2) what is the alternate name getContent for get_content used for ?
   do some tests depend on it ?

3) should nodeWrap(o) return an xmlAttr for type "dtd" ?
  I changed my code to return an xmlDtd, is it the right thing to do ?


I guess an order-of-magnitude speedup could be gained by moving the
whole libxml2.py to a C module. Are there any plans of doing so ?

-- 
Hannu Krosing <hannu tm ee>

Attachment: libxml_py22.py
Description: Text Data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]