Re: [xml] redicting parts of trees

On Sun, May 15, 2005 at 12:37:38AM +0200, Martijn Faassen wrote:
Would the developers be open to me suggesting changes to the XSLT 
codebase to make this possible again? I suppose I should ask on the

Yes that should be doable. I'm not sure what would be the best API
for this.

I'm not either, but I'll think about this. Would sharing a dictionary
break the read-only guarantee though, and thus break multi-threading?

  I'm afraid yes. Basically if node are generated then the dictionnary
associated to the transformation will grow. If multiple transformations
runs in parallel using the same dictionnary then you have concurrent
unsynchronized accesses to the dictionnary. At least it was the line of 
thought when I created that subdict thing. I note that in the meantime
we added a mutex to the dictionnaries so this should no longer be
a problem. So in a nutshell I think it will break the read-only assumption
but it won't break multithreading, so this should be doable.

list, so let's move on to the real purpose of this mail.

Exploring these issues made me conclude that it's time to at least
look at the alternative to sharing a single global dictionary,
redicting parts of trees. A redicting operation would take place
whenever a node is moved into a new tree. All the strings in the
subtree below this node will be traced to the originating
document's dictionary, and the entries will be copied into the
target document's dictionary. Additionally, all string references
in the subtree will be made to point to the new document's

Yes, it seems that at the DOM level this operation is called an
import based on some PHP/javascript examples I saw recently.

Yes, the W3C DOM indeed defines an importNode operation, and I guess I'm
asking for the equivalent here. :)

I think that if we add this then we should try to match the existing
semantic of those operation in PHP for example.

Does PHP implement this operation on top of libxml2? We might also want

  yes. PHP5 is on top of libxml2. Hard to tell how reusable the code would be
as an example without looking at it. They may have some intermediate layer.

to consider the W3C importNode semantic, though I doubt it actually says
much of use for us here...

  Well just trying to follow the principle of least surprise.

The thing which need to be checked when preparing for such an import
are: - doc remapping

By this you mean telling all nodes about the new document node, right?


- dictionnary remapping - namespace references to the original

As a document contains a list of all the namespace references, right? So

  yes but not centralized. You have to walk the subtree and the ancestors
to build a full picture.

if the original document were to be destroyed, namespace references to
it from nodes now in new documents would be pointed to free space.

  to freed data, yes.

- namespace remapping to the local document

What does this mean as compared to the previous, namespace references to
the original document?

  instead of recreating all the namespace declaration used in the subtree
at the insertion point, then reuse the declaration already in scope at that
insertion point. Example if {"dbk", ""} is in scope at the
insertion point due to an ancestor holding the declaration, then if this
is among the namespace bindings in use by the subtree, do not redeclare
it at the insertion point. Apparently a difference between PHP5 and javascript
implementation of import seems to be that javascript one would reuse the
namespace declaration if defined with a different prefix (hence changing the
prefixes in the pruned subtree), I don't know which smeantic is DOM's one :-)

- entities reference to the original document I think those are the
only pointers which are added to the pure tree oriented
parent/child/sibling ones.

Thanks for the list!

  no problem.

Looking at the import implementation of PHP5 might give us an idea of
how to implement this. Note that there are incomplete APIs dealing
either just with document pointers (xmlSetTreeDoc) or just namespaces
(xmlNewReconciliedNs and xmlReconciliateNs).

Okay, I shall study the implementations of those. It would probably be
more efficient to  provide a function that did all the remapping in one
operation as it traversed the tree, though.

  yes that's what I have in mind. It should be doable in a single pass
first walking all the ancestors of the insertion node to collect existing
namespaces, then a scan in document order of the subtree being moved.
I'm not sure it's fun though, xmlReconciliateNs() does some of this.


In order to write a good redicting operation, I'd need a bit more 
information about which information in a tree exactly can end up in
a dictionary. If someone would be able to give me a list of what
ends up in the dictionary, that would be extremely helpful.

All markup names, all namespaces strings (prefix and namespace names)
and some text node content (so that all "formatting nodes" used to
indent share as much as possible, or very short text nodes for
example "0" or "1").

Short text nodes includes attribute values?

  yes, an attribute has a children list, which is usually a single text node
but sometimes also include entity references and text node intermixed.

P.S.: I think I should be able to design a method to make importing
strings from a given dictionnary into python strings quite faster for
repeatedly querying the same set of strings. The principle would be
to add an API to the dictionnary returning an index for the string
(cost O(1)) and at the python binding level have an array keeping
pointers to the strings already converted (Py_INCREF'ed of course).

That would be very nice to have! I played with this idea before myself, 
but didn't get anything working yet. I will think about this some more. 
Is there any userdata facility in dictionaries already?

  no there isn't. They are an opaque structure too. Adding a _private would
require accessor functions to be added. That's doable.


Daniel Veillard      | Red Hat Desktop team
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]