[xml] Python documentation - any help welcome!



Hi,

After struggling to get to grips with Libxml2 and Python, I figured that although I can't contribute much in 
the way of code, I can have a crack at getting some useful documentation up together.

I have put the first part up on my Wiki, if anyone would care to review for accuracy - or help out where it 
is a bit light on examples?

http://mikekneller.com/wiki/index.php?title=Getting_started_with_Libxml2_and_Python_-_part_1

I realise that this is probably a bit n00b for most here, but I would like to bring together workable 
examples from the ground up, most of the other information I have read assumes a level of knowledge I just 
didn't have when I encountered the library for the first time.

For reference, I'll post the text here.

Cheers
Mike

=== Getting started with Libxml2 and Python - Part 1 ===

Overview

Getting to grips with Libxml2 and Python can be a frustrating experience, 
particularly as in-depth, accurate Python documentation is hard to find 
on the Web.

Many Python developers dislike the Libxml2 bindings, as they are 'un-Pythonic'
and much too C-like. This however misses the point of Libxml2. The point is that
this library is portable, mature, extremely full-featured and *very* fast.

In the process of writing this tutorial, I hung out in the #xml channel on 
irc.gnome.org, and subscribed to the xml gnome org mailing list - I 
was given a lot of help when things weren't obvious! Although there's not a massive 
amount of activity on IRC, or in the mailing list on a daily basis, I would
definitely recommend spending some time browsing the archive - or using Google
to search it when you have questions. Additionally, I have found the people in 
the Libxml2 community very helpful. 

Manipulating XML using Libxml2 is fairly straightforward when you have a couple
of working examples, however that tends to be the problem in Python. Finding 
working examples tends to be a bit of a hit-and-miss affair.

The first place to look is in the examples folder in the documentation installed
with your release (/usr/share/doc/libxml2-python-2.6.27/examples on my machine).

TODO: where are the examples on a number of distributions/platforms?

Also, take a moment to scan through libxml2.py itself - this is the Python wrapper and
is a good place to look if you are hunting for a particular function. There
is plenty of information in the wrapper as all the docstrings have been 
populated, you can always get information like

        print libxml2.parseFile.__doc__
        
for any particular function.

Also remember that you can list the available methods for any Python object by 
using the dir function. The most immediately useful objects are xmlCore, xmlNode
xmlDoc, so
        dir(libxml2.xmlCore)
is your friend when working out what functions are available to you.

I'm going to assume that you know a bit about XML, at least enough to recognise
an XML document when you see one, and hopefully enough about Python to know 
where to find the documentation!

[installing Libxml2]

TODO: installation examples for a number of distros/platforms.

[Loading a document]

The first thing you want to do in XML will be to load a document of some sort.
As a new Libxml2 user, this is where our confusion starts! It is worth remembering
that in general, the Python bindings are automatically generated - therefore
there is an equivalent Python function for every C function, and sometimes this
can lead to unnecessary, or apparently duplicated Python functions.

The library contains a number of different functions we can use to load an XML 
document:

        parseDoc, parseFile, parseMemory, readDoc, readFd, readFile, readMemory,
        recoverDoc and recoverFile

All of these functions return an xmlDoc object. Examples for using each of these
follow:

        parseDoc(cur) - load an XML document from memory (a string)

        doc = libxml2.parseDoc("""<?xml version="1.0"?>
        <root>Hello world!</root>""")   
        

        parseMemory(buffer, size) - load an XML document from memory
        
        doc = libxml2.parseMemory(xml, len(xml))
        
This function performs exactly the same job as parseDoc from a Python perspective.
        

        parseFile(filename) - load an XML document from a file
        
        doc = libxml2.parseFile('test.xml')

        
        readDoc(cur, URL, encoding, options) - load an XML document from memory (a string)
        
This version of the function allows you to specify options on a per-document
basis. The parseDoc version uses the parser defaults (in practice, the 
parser global settings, which can also be modified using global functions).
        
        In most cases,
                doc = libxml2.readDoc('<foo/>',None,None,0)
        will be equivalent to
                doc = libxml2.parseDoc('<foo/>')

When using XSL, I have found it better to force entities
to be resolved before running the transform, in which case it is useful to
use the following:
        
        doc = libxml2.readDoc( xml, None, libxml2.XML_PARSE_NOENT)
        

        readFd(fd, URL, encoding, options) - load an XML document from a file descriptor
        
        readFile(filename, encoding, options) - load an XML document from a file allowing
        the specification of per-document options.

        
        readMemory(buffer, size, URL, encoding, options) - for Python, equivalent to
        using readDoc


        recoverDoc(cur) - this is equivalent to readDoc, except that even broken XML
        will result in a valid XML tree being created.
        
        doc = libxml2.recoverDoc('<foo><broken></foo>')
        
will raise a parser error, but after the error has been handled, doc will
contain:
        <?xml version="1.0"?>
        <foo><broken/></foo>


        recoverFile(filename) - same as recoverDoc, but for files.


In the simplest case, to load a file from disk you can do:

        doc = libxml2.parseFile( 'test.xml' )

[Managing your memory]

Ugh, nasty memory management. Isn't that why we're using Python, to avoid all that
stuff?

Libxml2 does not explicitly handle the cleaning up of the memory it uses, so when 
you finish working with your xmlDoc object, you need to remember to call freeDoc.

OK, so what we have now is something like the following:

        doc = libxml2.parseFile( 'test.xml' )
        # Do some stuff with the document here!
        doc.freeDoc()

It doesn't matter which method you use to create your xmlDoc object - each of the
functions return the same thing, so just remember to call freeDoc on it when you
are done and all will be well.

There, that wasn't so hard was it? :-)


[Working with the document]

Now we have a working document, and know how to dispose of it when we're done
it is time to look at a number of common XML operations and see how we can do
those using Libxml2 and Python.

[Elements]

The xmlDoc object has a large number of methods. As well as its own collection, 
it inherits from xmlNode, which inherits from xmlCore; this gives you over 200
available methods to read up on! This is fairly daunting, when you can't find an
example that shows you how to perform simple tasks but don't worry, In practice
we can get by in most situations with a small fraction of these.

All valid XML documents contain a single root node, which contains all the
other nodes.

You can get a reference to the root element using getRootElement on the document
object. The root element is an xmlNode object, just like all other nodes in the 
document. Working with nodes is fairly straightforward:

        >>> import libxml2
        >>> doc = libxml2.parseDoc( '<foo>Hello world.</foo>' )
        >>> root = doc.getRootElement()
        >>> print root.name
        foo
        >>> print root.content
        Hello world.
        >>> root.setProp('bar', 'an attribute')
        <xmlAttr (bar) object at 0x13c00d0>
        >>> print root.serialize()
        <foo bar="an attribute">Hello world.</foo>
        >>> doc.freeDoc()

The serialize method can be called on a single node, or on the document and 
provides a string representation of the document.

Navigating through the document is not much more difficult - we can use the node
properties (from the xmlCore ancestor object) to find the child nodes:

        child = root.children
        # the children property returns the FIRST child of a node
        while child is not None:
                if child.type == "element":
                        # do something with the child node
                        print child.name
                        child = child.next

Accessing the attributes of a node is possible in a similar way

        import libxml2
        doc = libxml2.parseDoc('<foo att1="value 1" att2="value 2"/>')
        root = doc.getRootElement()
        for property in root.properties:
                if property.type=='attribute':
                        # do something with the attributes
                        print property.name
                        print property.content
        doc.freeDoc()

Notice that in both looping through the children, and looping through the 
properties there is a test for the type of the node. This is because in most
documents, there is additional whitespace that shows up as well as the specific
node types we are interested in.

[XPath]

Navigating a document in this manner is straightforward, but tedious and requires
accessing every node in the document until you get to the specific one you need.
More often, you want to retrive a set of nodes or a single node matching some
specific criteria. This is where XPath comes in, and Libxml2 has full support
for XPath.

XPath queries can be run against the document or a specific element in the 
document, but in either case the procedure is the same.

The xmlsoft.org Python page suggests the following:

        doc = libxml2.parseFile("test.xml")
        ctxt = doc.xpathNewContext()
        result = ctxt.xpathEval("//*")
        # do something with the result
        
        doc.freeDoc()
        ctxt.xpathFreeContext()

which involves creating an XPath context, running a query against it and then
freeing the context when finished. If you have a lot of queries to run, then
this is the best way to work, as the context can be re-used for each query.

In practice, the xmlCore object provides a helper function which wraps this up 
for you. For single queries running xpathEval directly on the node will suffice, 
just be aware that each query creates and destroys its own context, which is 
going to be slower than the above implementation.

An XPath query will return a tuple of nodes. This makes it easy to perform an
operation on many nodes at once.

        import libxml2
        doc = libxml2.parseFile('test.xml')
        # select every element in the document
        result = doc.xpathEval('//*')
        for node in result:
                print node.name
        doc.freeDoc()

Apart from the call to freeDoc, I can't see how much more Pythonic it could be?

[Writing to to a file]

To write the contents of your XML document to a file, just use the saveTo method:

        f = open('output.xml','w')
        doc.saveTo(f)
        f.close

The saveTo method is also part of xmlCore, so you can use it to save the contents
of just a single node and it's children as well as the whole document.

[Modifying documents]

To add a new node to a document, first we must create the node and then add it 
as a child of the element it belongs to.

        import libxml2
        doc = libxml2.parseDoc('<foo/>')
        root = doc.getRootElement()
        newNode = libxml2.newNode('bar')
        root.addChild(newNode)

At this stage, our document contains

        <?xml version="1.0"?>
        <foo><bar/></foo>

Using the content property of newNode, we can do:
        
        newNode.setContent('Hello')

We can append some content to our <bar/> element by calling addContent,

        newNode.addContent(' world')
        
which gives us

        <?xml version="1.0"?>
        <foo><bar>Hello world</bar></foo>

Creating or setting an attribute is easy to, we use the setProp method.

        newNode.setProp('attribute', 'the value')

If the attribute doesn't exist, it will be created otherwise it will just have
its content changed.

Adding nodes at a particular location in the hierarchy is possible using 
addNextSibling, or addPrevSibling. These operate in the same way as addChild, 
except they operate on the node you wish to add next to, rather than to the
parent.

        sibling = libxml2.newNode('bar2') 
        newNode.addPrevSibling(sibling)

gives

        <?xml version="1.0"?>
        <foo><bar2/><bar new attribute="the value">Hello world</bar></foo>

whereas

        sibling = libxml2.newNode('bar2') 
        newNode.addNextSibling(sibling)

gives

        <?xml version="1.0"?>
        <foo><bar new attribute="the value">Hello world</bar><bar2/></foo>

To insert text into the document, you create a text node with some content and
add it in the same way

        text = libxml2.newText('some text\n')
        bar.addNextSibling(text)
        
which leaves us with

        <?xml version="1.0"?>
        <foo><bar2/><bar new attribute="the value">Hello world</bar>some text
        </foo>

To create content and nodes, the useful Libxml2 helper functions are newComment,
newText and     newNode. You can also create a new node by copying one that already 
exists. The xmlNode object has copyNode and copyProp methods which can be useful
here.

To add these new nodes into a document, you need to use one of the following
methods (directly on nodes rather than on the document), addChild, addContent,
addNextSibling, addPrevSibling.

[XSLT]

Libxml2 has a companion library called libxslt which provides support for
XSL Transformations. I find the following example provides most of the 
useful information for a Python coder:

        def runTransform(xmlFile,xslFile):
                out = ''
                sourcedoc = libxml2.parseFile( xmlFile )
                styledoc = libxml2.parseFile( xslFile )
                style = libxslt.parseStylesheetDoc(styledoc)
                result = style.applyStylesheet(sourcedoc, None)
                out = style.saveResultToString( result )
                style.freeStylesheet()
                result.freeDoc()
                sourcedoc.freeDoc()
                return out

Notice that there are three documents involved, each of which need to be 
explicitly freed, the source, the stylesheet and the result. The starting point
for documentation can be found here, http://xmlsoft.org/XSLT/python.html.

[Libxsl2 and HTML]

If you have spent any time poking around libxml2.py, you will probably have 
noticed a number of functions that start with html. This is because Libxml2 has
an HTML parser built in that does a pretty good job of loading real world 
(in other words horribly broken) HTML documents. You can then use the features
we have previously discussed to read or modify the HTML.

The following example will load pretty much any HTML file into an xmlDoc object

        parse_options = libxml2.HTML_PARSE_RECOVER + \
                libxml2.HTML_PARSE_NOERROR + \
                libxml2.HTML_PARSE_NOWARNING
        doc = libxml2.htmlReadDoc(html, '', None, parse_options)

Here is a more complete example, which extracts all the links from the Guardian
newspaper Website home page and prints the href attribute.

        import urllib2
        import libxml2

        # Load the page into a string
        f = urllib2.urlopen('http://www.guardian.co.uk')
        html = f.read()
        f.close()

        parse_options = libxml2.HTML_PARSE_RECOVER + \
                libxml2.HTML_PARSE_NOERROR + \
                libxml2.HTML_PARSE_NOWARNING
        doc = libxml2.htmlReadDoc(html,'',None,parse_options)
        links = doc.xpathEval('//a')
        for link in links:
                href = link.xpathEval('attribute::href')
                if len(href) > 0:
                        href = href[0].content  
                        print href
        doc.freeDoc()





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]