Re: [xml] redicting parts of trees


On Thu, 2005-05-19 at 17:16 +0200, Martijn Faassen wrote:
Kasimier Buchcik wrote:
[snip stuff that goes over his head without a lot of further study]

This is just a cheerleading note; I'm really glad you guys are taking 
this up, as I can already see there are many subtle issues involved I 
would not have understood without significant study. Thanks!

Anyway, anything I can do now to help? I will of course be testing this 
facility at some stage within lxml, and give feedback then if necessary.

You could describe how you intend to manage namespaces in your
wrapper. Will you try to go W3C way or Libxml2 namespace way?
Both have pros and cons. The relevant drawback in Libxml2 way
is that it's hard, if even not possible, to implement a DOM wrapper
which uses a programming language, where the time of destruction
of an object lies not within the control of the programmer.

Let me try to give some background information - possibly too
detailed. I hope to be corrected if something's wrong:

Libxml2 handles the corresponding DOM Node methods namespaceURI() and
prefix() in the following way: 

node->ns->prefix == result of node.prefix()
node->ns->href   == result of node.namespaceURI()

The node->ns field is a pointer to an xmlNs struct, which
is held in the elem->nsDef field of element-nodes.
Such node->nsDef entries correspond to namespace declaration
attributes in DOM (e.g. xmlns:foo="urn:test:foo). Libxml2's
way demands a node->nsDef entry, thus a namespace declaration
attribute, on the node itself or on an ancestor node to be
present; which totally reflects the serialized (written as
XML file) form.

This circumstance creates the following problem:
If your remove a attribute-node, which is bound to a namespace,
from it's parent, the attr->ns field still points to an elem->nsDef
entry. This is OK, as long as this element-node is not itself
freed - which would free the elem->nsDef entries as well. The
destruction of this element would lead to attr->ns pointing to freed
memory. There's no automatic mechanism to avoid this, since there is
no reference counting involved. In C this should be user
controllable: you just have to know what and when you are freeing
something. Not in other programming languages like Python, Delphi,
Java, etc. where the destruction time on objects is not always - if
ever - predictable.

Safe removal of nodes:
So we obviously need a mechanism to let point the node->ns reference
to an xmlNs entry which is not in danger of being freed unpredictably.
A possible location would be an list of xmlNs entries, internally
managed by the DOM document wrapper. Another would be to use the
"oldNs" field of xmlDoc, or even add a new field to xmlDoc for such
In Libxml2 this can be currently workarounded by reconciliating
the node, this re-creates such "stale" declarations on the node.
This is quite unpractical since it could end up in creating a vast
amount of redundant ns-declarations. Additionally it does not work
if an attribute is removed.

Namespace reconciliation:
When working with Libxml2 in C, adding, cut & pasting nodes, one
could end up with a tree, where some of the node->ns entries point to
node->nsDef entries located in the wrong position (think of shadowing a
namespace prefix). Serializing such a document would end up in a 
not namespace wellformed XML document. For this reason there's a
namespace reconciliation function in Libxml2; it adds namespace
declaration attributes where needed, so that the document will be
ns-wellformed again. This partly corresponds to the W3C
namespace normalization method.

The function we try to create here should support both: a way
to safely move nodes from the tree and reconciliate namespaces.

The way of our wrapper implementation:
We use the node->nsDef entries only for serialization purposes.
So if working with DOM, node-ns does reference internally stored
entries, not node->nsDef entries. Thus we seperate namespace
declaration attributes from Node.namespaceURI() and Node.prefix
values; which is the W3C way. With DOM you can remove or add
an ns-declaration attribute wherever you want, it does not change
any node's namespace.
The ns-declaration attributes are only there to give the user the
ability to explicitely define locations where the XML processor should
create ns-declarations when serializing. This is important for QNames,
which can be in text content of a node. Example: XML Schema's
<xs:element ref="foo:someName"/> here "foo:someName" is a QName which
needs a namespace to be declared beforehand, with the _same_ prefix.
Like with Libxml2's way we need to normalize the namespaces before
serializing. Which could be optimized to only normalized branches
where changes through API has been done.

Now a part that may be surprising:
_Neither_ Libxml2's reconciliation function, _nor_ W3C's
namespace normalization can avoid breaking a QName in some
special cases through changing of the ns-prefix.
Libxml2's current behaviour being more lax here, since
it might break QNames in element content and in attribute values,
while W3C might break them in attribute values only.
It do not encourage people to use our companies way of handling
namespaces, since it might make problems in the future: it's not
Libxml2's way and thus not handled, thus maybe 100% against some
internal expectation in the future. I hope to die before this time
comes ;-)



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]