Re: [xml] redicting parts of trees



Kasimier Buchcik wrote:
Hi,

On Thu, 2005-05-19 at 17:16 +0200, Martijn Faassen wrote:

Kasimier Buchcik wrote:
[snip stuff that goes over his head without a lot of further study]

This is just a cheerleading note; I'm really glad you guys are taking this up, as I can already see there are many subtle issues involved I would not have understood without significant study. Thanks!

Anyway, anything I can do now to help? I will of course be testing this facility at some stage within lxml, and give feedback then if necessary.


You could describe how you intend to manage namespaces in your
wrapper. Will you try to go W3C way or Libxml2 namespace way?

I'm following the ElementTree way, which uses Clarke notation. I.e. the wrapper shows namespace URIs directly as part of element names and such, like this:

{http://namespaces.somewhere.org/ns1}foo

and prefixes are, for now, completely ignored as not relevant to the XML infoset.

Both have pros and cons. The relevant drawback in Libxml2 way
is that it's hard, if even not possible, to implement a DOM wrapper
which uses a programming language, where the time of destruction
of an object lies not within the control of the programmer.

Thanks, this is interesting as this is exactly what I'm trying to do with lxml.

Let me try to give some background information - possibly too
detailed. I hope to be corrected if something's wrong:

Libxml2 handles the corresponding DOM Node methods namespaceURI() and
prefix() in the following way:
node->ns->prefix == result of node.prefix()
node->ns->href   == result of node.namespaceURI()

The node->ns field is a pointer to an xmlNs struct, which
is held in the elem->nsDef field of element-nodes.

Right, I've been using this structure in the lxml implementation.

Such node->nsDef entries correspond to namespace declaration
attributes in DOM (e.g. xmlns:foo="urn:test:foo). Libxml2's
way demands a node->nsDef entry, thus a namespace declaration
attribute, on the node itself or on an ancestor node to be
present; which totally reflects the serialized (written as
XML file) form.

This circumstance creates the following problem:
If your remove a attribute-node, which is bound to a namespace,
from it's parent, the attr->ns field still points to an elem->nsDef
entry. This is OK, as long as this element-node is not itself
freed - which would free the elem->nsDef entries as well. The
destruction of this element would lead to attr->ns pointing to freed
memory.

Ugh. Luckily the ElementTree API doesn't allow the detaching of attribute nodes from an element, but I can see how this would hurt any W3C DOM implementation.

But now I wonder: does this only apply to attribute nodes, or also to element nodes which are in a subtree? Testing this.. Ugh, yes, it does. When I move a namespaced element (where the namespace is defined higher in the tree) into another tree, and then subsequently remove the original tree, things go way wrong and valgrind indeed points to a reference to a libxml2 namespace structure that has since been removed. Not good...

But thanks for pointing this issue out to me!

There's no automatic mechanism to avoid this, since there is
no reference counting involved. In C this should be user
controllable: you just have to know what and when you are freeing
something. Not in other programming languages like Python, Delphi,
Java, etc. where the destruction time on objects is not always - if
ever - predictable.

Indeed. Python tends to be fairly predictable if its refcounting algorithm is used, but that doesn't help any here, and that isn't constant across Python implementations anyway.

Safe removal of nodes:
So we obviously need a mechanism to let point the node->ns reference
to an xmlNs entry which is not in danger of being freed unpredictably.
A possible location would be an list of xmlNs entries, internally
managed by the DOM document wrapper.

Yes, in this case the problem would devolve to the issue I already have
with dictionaries, which is manageable as I can make this stuff globally shared. Though, just as with dictionaries I hope that the adoptNode() functionality could take care of this as well.

I suspect that adoptNode() recreating namespaces wherever necessary in the new document would indeed be sufficient to support Clarke notation in ElementTree, even though the XML serialization would look ugly.. Am I correct in that an adoptNode() would take care of this issue if prefixes are hidden from the API user's view?

Regards,

Martijn



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]