Re: [xml] Support for Python



On Thu, Jan 24, 2002 at 04:57:39PM -0800, Dave Kuhlman wrote:
I've been thinking about ways that libxml (and libxslt) can be used
to provide XML support for the Python programming language.

  Right, me too :-)

   We should consider providing support in the following areas:
   
     * Support for the DOM interface built on libxml (or gdome?).

   I would split it into:
     - Support for the tree interface built on libxml
     - Support for the DOM2 api built on gdome2

     * Support for the SAX interface built on libxml.
     * Support for XSLT built on libxslt. (Possibly a discussion for the
       libxslt list.)

  Yep, let's keep XSLt separate for a bit.

  Support for DOM
  
   I've build DOM support for Python by hand, i.e. manually written
   wrapper functions types, etc that expose libxml's DOM support.
   (Avaliable at http://www.rexx.com/~dkuhlman.) But it's weak.
   
   I've also used SWIG to generate wrappers for the libxml DOM support.
   Basically, I generated wrappers for the stuff in include/libxml/tree.h
   and include/libxml/parser.h. It works "pretty" good. A bit of
   evaluation:

  I'm tempted to go through a similar autogeneration of stub like SWIG
but would prefer to generate the Python glue directly from the XML formal
interface.

     * Because I used SWIG's shadow classes, the doc, nodes, and
       attributes look, from Python's point of view like instances of
       classes. So, walking the DOM tree is very easy and natural.

  Sounds a good idea, I need to look at the generated code. Anover way is to
make minimal wrappers and build more object oriented classes on top of
the raw function, defining the classes at the Python level.
  
     * One benefit of doing this -- The Python objects (xmlDoc, xmlNode,
       xmlAttr) are proxies for the "real", underlying (libxml) C objects
       and the linkages between objects are in the underlying C objects.
       Therefore, this implementation does not suffer from the problems
       caused by circular references in Python objects. (Note that I

  Okay point to check in any solution.

     * More over, the Python objects are created and destroyed on the fly
       and only on request. For example,
            node = node.children
            node = node.next
       This code creates two nodes. Furthermore, when the value of
       variable 'node' is over-written (and if there is no other
       reference to that value), the Python object is destroyed. (For

  Okay, I expect this from any implementation.

     * One qualification is that the interface is at the level of the
       libxml, so it's a bit low level. For example, a long running
       application would have to call a 'free' method, e.g. xmlFreeDoc,
       which is not something a Python programmer would expect to have to
       do.

  Hum, keeping reference counting for xmlDocPtr is nearly impossible,
I doubt there is a workaround. Well this need more thinking, the idea
of having to call a doc.free() at the end of the processing doesn't sound
that bad.

     * For another qualification is that this implementation needs some
       fix-up, because there are some kinds nodes in the tree that can
       cause segment faults.
  A python wrapper class sounds better to deal with an unified abstraction
of all the kind of nodes.

     * And, the generated code is a bit large. I'm not sure that this is
       a concern in a world where disk space is so cheap. It's possible

  I don't really care. Developpers only will have the generated stubs somewhere
only the size of the object shared library _libxmlmodule.so would be really
important.

   gdome -- Whoa. I thought the DOM support was in libxml. I'll have to

  No, it's DOM like but not DOM.

   look into gdome. Can someone enlighten me on the relationship between
   gdome and libxml DOM support. Does gdome support a newer version of
   the DOM spec? Should we build DOM support for Python on top of libxml
   or on top of gdome?

  Yes but first implement a tree support at the libxml level.

   Summary -- I'll continue to work on the SWIG wrappers for the libxml
   DOM interface. I'll try to fix a few problems that I've found and will
   look into generating support for encodings, catalogs, and entities.

  I'm not 100% sure that I want to go the SWIG way. I will look first at
the way the GTK python wrappers have been done and work from there.

  Support for SAX
     * Ease of use -- I've used it quite a bit and it seems quite easy to
       use and usable. It's a trivial task to create a Python handler
       class with methods like 'startElment', 'endElement', 'characters',
       etc and then do the parse to catch those events.

  SAX support should be close to trivial. I expect something similar
to your current interface but also allowing a compatible use with 
the xmllib and sgmlop interface where the callbacks are just 
    - close
    - getmethodname
    - data
    - start
    - end
    - handle_entityref

   Should be trivial to be able to handle both and would allow very easilly
migration of existing code,

   Creating a parser driver for PyXML built on libxml seems like a very
   good idea. There are several benefits to be gained from doing so:

   Since I don't know PyXML, I will abstain from commenting on this ATM.
seems this should be trivially implementable with just glue python code
on top of the sAx interface.

  Additional Notes

   Isn't Python2 internationalization layer based on UTF16 strings ?
libxml/libxslt uses UTF8 . Is there any gain of trying to follow the
Python2 conventions ? Is there any risk by staying with UTF8 strings
seens as usual python strings ?

   Converting to/from UTF16 all the time would be a killer.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]