Re: [xml] redicting parts of trees

From: Kasimier Buchcik <kbuchcik 4commerce de>
To: Martijn Faassen <faassen infrae com>
Cc: "xml gnome org" <xml gnome org>
Subject: Re: [xml] redicting parts of trees
Date: Thu, 19 May 2005 21:00:03 +0200

Hi,

On Thu, 2005-05-19 at 20:19 +0200, Martijn Faassen wrote:

Kasimier Buchcik wrote:

Hi,

On Thu, 2005-05-19 at 17:16 +0200, Martijn Faassen wrote:

Kasimier Buchcik wrote:


[...]

Anyway, anything I can do now to help? I will of course be testing this 
facility at some stage within lxml, and give feedback then if necessary.



You could describe how you intend to manage namespaces in your
wrapper. Will you try to go W3C way or Libxml2 namespace way?


I'm following the ElementTree way, which uses Clarke notation. I.e. the 
wrapper shows namespace URIs directly as part of element names and such, 
like this:

{http://namespaces.somewhere.org/ns1}foo

and prefixes are, for now, completely ignored as not relevant to the XML 
infoset.

Ah.

Both have pros and cons. The relevant drawback in Libxml2 way
is that it's hard, if even not possible, to implement a DOM wrapper
which uses a programming language, where the time of destruction
of an object lies not within the control of the programmer.


Thanks, this is interesting as this is exactly what I'm trying to do 
with lxml.


Yeah, I read some of the message on your lxml list about your mechanism
to keep detached nodes alive if they are referenced by multiple wrapper
proxies. We took a sometimes memory-consuming but simple approach: we
never free any removed Libxml2 nodes from the document, they are moved
into an internal list of "garbage" nodes in the document wrapper and
freed when the document is freed. A "flush" method can be used to
cleanup such "garbage" nodes, if the user is sure that it's safe.

An example (in Delphi code):
(all vars are interfaces here, not objects)
var
  doc: IDOMDocument;
  elem: IDOMElement;
  node: IDOMNode;
begin
  elem := doc.documentElement;

  // Remove and put on garbage list.
  node := doc.documentElement.removeChild(elem); 
  { Here @node will be freed by Delphi, but the Libxml2's node
    lives further. }

  // This would free elem's Libxml2-node.
  // doc.flushGarbage;

  // Attach do tree and remove from garbage list. 
  doc.appendChild(elem); 
end;

[...]

This circumstance creates the following problem:
If your remove a attribute-node, which is bound to a namespace,
from it's parent, the attr->ns field still points to an elem->nsDef
entry. This is OK, as long as this element-node is not itself
freed - which would free the elem->nsDef entries as well. The
destruction of this element would lead to attr->ns pointing to freed
memory.


Ugh.  Luckily the ElementTree API doesn't allow the detaching of 
attribute nodes from an element, but I can see how this would hurt any 
W3C DOM implementation.


For the ElementTree Libxml2's way seems to be safe enough. Good!

But now I wonder: does this only apply to attribute nodes, or also to 
element nodes which are in a subtree? Testing this.. Ugh, yes, it does. 
When I move a namespaced element (where the namespace is defined higher 
in the tree) into another tree, and then subsequently remove the 
original tree, things go way wrong and valgrind indeed points to a 
reference to a libxml2 namespace structure that has since been removed. 
Not good...

But thanks for pointing this issue out to me!

There's no automatic mechanism to avoid this, since there is
no reference counting involved. In C this should be user
controllable: you just have to know what and when you are freeing
something. Not in other programming languages like Python, Delphi,
Java, etc. where the destruction time on objects is not always - if
ever - predictable.


Indeed. Python tends to be fairly predictable if its refcounting 
algorithm is used, but that doesn't help any here, and that isn't 
constant across Python implementations anyway.

Safe removal of nodes:
So we obviously need a mechanism to let point the node->ns reference
to an xmlNs entry which is not in danger of being freed unpredictably.
A possible location would be an list of xmlNs entries, internally
managed by the DOM document wrapper.


Yes, in this case the problem would devolve to the issue I already have
with dictionaries, which is manageable as I can make this stuff globally 
shared. Though, just as with dictionaries I hope that the adoptNode() 
functionality could take care of this as well.

I suspect that adoptNode() recreating namespaces wherever necessary in 
the new document would indeed be sufficient to support Clarke notation 
in ElementTree, even though the XML serialization would look ugly.. Am I 
correct in that an adoptNode() would take care of this issue if prefixes 
are hidden from the API user's view?


Yes, in your case, if single attributes are not expected to be adopted,
and potentially many auto-created namespace declarations don't bother
you, the mechanism of xmlReconciliateNs seems best fitting: it just
re-creates the missing declarations on the adopted element. OK, good to
know that!

Regards,

Kasimier

Follow-Ups:
- Re: [xml] redicting parts of trees
  - From: Kasimier Buchcik
- Re: [xml] redicting parts of trees
  - From: Martijn Faassen

References:
- Re: [xml] redicting parts of trees
  - From: Daniel Veillard
- Re: [xml] redicting parts of trees
  - From: cazic
- Re: [xml] redicting parts of trees
  - From: Kasimier Buchcik
- Re: [xml] redicting parts of trees
  - From: Daniel Veillard
- Re: [xml] redicting parts of trees
  - From: Kasimier Buchcik
- Re: [xml] redicting parts of trees
  - From: Martijn Faassen
- Re: [xml] redicting parts of trees
  - From: Kasimier Buchcik
- Re: [xml] redicting parts of trees
  - From: Martijn Faassen

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]