[xml] libxml support for Python



I've been thinking about ways that libxml (and libxslt) can be used
to provide XML support for the Python programming language.

What follows are notes that I've written on this.

I'll appreciate any comments, suggestions, guidance, etc.

===========================================================



                     libxml Support for Python
                                      
   I'm interested in providing support for Python built on libxml
   and libxslt.
   
   Since this is the libxml list, I'll concentrate on libxml. Although,
   it is important to by able to use libxslt from Python, too.
   
   We should consider providing support in the following areas:
   
     * Support for the DOM interface built on libxml (or gdome?).
     * Support for the SAX interface built on libxml.
     * Support for XSLT built on libxslt. (Possibly a discussion for the
       libxslt list.)
       
  Support for DOM
  
   I've build DOM support for Python by hand, i.e. manually written
   wrapper functions types, etc that expose libxml's DOM support.
   (Avaliable at http://www.rexx.com/~dkuhlman.) But it's weak.
   
   I've also used SWIG to generate wrappers for the libxml DOM support.
   Basically, I generated wrappers for the stuff in include/libxml/tree.h
   and include/libxml/parser.h. It works "pretty" good. A bit of
   evaluation:
   
     * Because I used SWIG's shadow classes, the doc, nodes, and
       attributes look, from Python's point of view like instances of
       classes. So, walking the DOM tree is very easy and natural.
     * One benefit of doing this -- The Python objects (xmlDoc, xmlNode,
       xmlAttr) are proxies for the "real", underlying (libxml) C objects
       and the linkages between objects are in the underlying C objects.
       Therefore, this implementation does not suffer from the problems
       caused by circular references in Python objects. (Note that I
       believe that I also solved this problem in my hand-written
       wrappers.)
     * More over, the Python objects are created and destroyed on the fly
       and only on request. For example,
            node = node.children
            node = node.next
       This code creates two nodes. Furthermore, when the value of
       variable 'node' is over-written (and if there is no other
       reference to that value), the Python object is destroyed. (For
       non-Python people who are still reading, Python uses a reference
       counting strategy for managing memory.) The up-shot is that this
       implementation (and my hand-written one as well) enables Python
       scripts to load and use large DOM trees with very little memory
       over-head above that used by the libxml C objects.
     * One qualification is that the interface is at the level of the
       libxml, so it's a bit low level. For example, a long running
       application would have to call a 'free' method, e.g. xmlFreeDoc,
       which is not something a Python programmer would expect to have to
       do.
     * For another qualification is that this implementation needs some
       fix-up, because there are some kinds nodes in the tree that can
       cause segment faults.
     * And, the generated code is a bit large. I'm not sure that this is
       a concern in a world where disk space is so cheap. It's possible
       that we will want to trim and not generate code for a few things.
       On the other hand, there may be additional libxml DOM related
       capabilities that we would also want to expose (and which would
       make it even larger. Catalogs (catalog.h)? Entities (entity.h)?
       Encodings (encoding.h)?
       
   gdome -- Whoa. I thought the DOM support was in libxml. I'll have to
   look into gdome. Can someone enlighten me on the relationship between
   gdome and libxml DOM support. Does gdome support a newer version of
   the DOM spec? Should we build DOM support for Python on top of libxml
   or on top of gdome?
   
   Summary -- I'll continue to work on the SWIG wrappers for the libxml
   DOM interface. I'll try to fix a few problems that I've found and will
   look into generating support for encodings, catalogs, and entities.
   I'll also try to learn a bit more about gdome.
   
  Support for SAX
  
   I've built Python wrappers for the libxml SAX support by hand (i.e.
   not generated by SWIG). (Avaliable at http://www.rexx.com/~dkuhlman.)
   A bit of evaluation:
   
     * Ease of use -- I've used it quite a bit and it seems quite easy to
       use and usable. It's a trivial task to create a Python handler
       class with methods like 'startElment', 'endElement', 'characters',
       etc and then do the parse to catch those events.
     * Efficiency -- The wrapper C code checks, at the beginning of the
       parse, to determine which event handler methods are defined in the
       handler class. Then, during the parse, the C code does not call
       any Python code (or do look-ups) for those event handlers that are
       _not_ defined. For example, if the method 'characters' is not
       defined in the handler class, then the C code will not call the
       Python code for event characters. So, processing should be quite
       efficient when a minimum of work is done in Python. Purhaps
       another way to say this is that there will be Python over-head
       only where that over-head is needed.
       
   Creating a parser driver for PyXML built on libxml seems like a very
   good idea. There are several benefits to be gained from doing so:
   
     * It would be fast, because most of the work would be done in C
       (libxml).
     * It would provide a validating parser.
     * It would be both fast and validating. This is something that PyXML
       (to my understanding) does not currently have. pyexpat and sgmlop
       are fast (because they are implemented in C. And, xmlproc is a
       validating parser. But no current driver is both.
       
   Here are a couple of issues that we should keep in mind:
   
     * Building libxml is a reasonable amount of work which not every
       user of PyXML is likely to want to do. Therefore, we will most
       likely want to package a libxml parser driver for PyXML as an
       add-on, i.e. as something that a can be built and installed after
       PyXML has been installed.
     * For speed it would be advantageous to not execute call-backs (or
       do call-back look-up) for event handler methods not define in the
       handler class.
       
   Summary -- I'll start looking into and working on a parser driver (the
   equivalent of pyexpat or sgmlop) for PyXML built on top of libxml.
   
  Support for XSLT
  
   I've built Python wrappers for libxslt. (Avaliable at
   http://www.rexx.com/~dkuhlman.) I've had one user report a bug, which
   I fixed. I've used it a reasonable amount. It's very easy to use.
   
  Additional Notes
  
   One of my goals when I started my work in exposing libxml and libxslt
   to Python was to provide an alternative source of XML support for
   Python. My belief is that it adds credibility to the Python project to
   have more than one source of support for something as important as
   XML. So, I feel that it is important that we both support the PyXML
   effort and that I provide independent support built on top of
   libxml/libxslt.


-- 
Dave Kuhlman
dkuhlman rexx com
http://www.rexx.com/~dkuhlman



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]