[xml] Fwd: issue with libxml2, python and entities



Heh, sorry about that. 

My testcase was not as I mentioned. However, the result is the same with the following test.xml:

<?xml version="1.0"?>
<!DOCTYPE content [
<!ENTITY copy "&#169;">
]>

<content>
<p>&copy;2007 Mike Kneller</p>
</content>

in other words, not using external entity references.

I didn't realise how old the version of libxml on my system was. Anyway, I've updated to libxml2-2.6.27 and Hey presto!!! the problem disappears. Thanks for your comments - it was only the mention of "ancient version" that got me looking!

Cheers
Mike

Begin forwarded message:

From: Daniel Veillard <veillard redhat com>
Date: 27 January 2007 11:00:18 GMT
To: Mike Kneller <ukchill mac com>
Subject: Re: [xml] issue with libxml2, python and entities

On Sat, Jan 27, 2007 at 12:34:21AM +0000, Mike Kneller wrote:
I am not sure if I have located a bug or not....

Using Python (2.4) and libxml2.2.6.22

When I load an document containing an entity, if I attempt to read  
the value of a node containing an entity, I get the text content and  
the entity disappears.
In the following example, when looking at root.content I would expect  
to see '&#169;2007', instead all I get is '2007'.

I was advised on the #XML IRC channel to construct a simple test  
case, so here it is:


  On IRC you said the entity was defined in the internal subset, it's not

File 1: test.xml

<?xml version="1.0"?>
<!DOCTYPE content [
<!ENTITY % HTMLlat1 PUBLIC
    "-//W3C//ENTITIES Latin 1 for XHTML//EN"
%HTMLlat1;
]>
<content>
<p>&copy;2007</p>
</content>

  libxml2 doesn't load externl subset by default,

File 2: testcase.py

import libxml2
sourcedoc = libxml2.parseFile( 'test.xml' )
root = sourcedoc.getRootElement()
print root.serialize()
print root.content

  So your content element has 2 children an entity reference
to copy whose content is unknown and the text node with "2007"

Reading the source for libxml2.py, I find the following:
     def getContent(self):
         """Read the value of a node, this can be either the text
            carried directly by this node if it's a TEXT node or the
            aggregate string of the values carried by this node
            child's (TEXT and ENTITY_REF). Entity references are
            substituted. """
         ret = libxml2mod.xmlNodeGetContent(self._o)
         return ret


Which in my (admittedly limited) understanding I would have thought  
would return the translated entity as well as the text when I examine  
root.content.

Is this a bug, or am I doing something wrong?

  Not askling to load the external subset, use readFile and pass the
XML_PARSE_DTDLOAD option. It should work even with the ancient version
2.6.22 , but please firtst upgrade first in case of problem.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]