Re: [xml] redicting parts of trees



On Fri, May 13, 2005 at 11:15:08PM +0200, Martijn Faassen wrote:
Hi there,

libxml2 documents use dictionaries to hash commonly used strings in XML 
documents, for reasons of space and time efficiency.

Unfortunately this makes it impossible to safely move a node from one 
tree to another.

In the process of writing lxml, an alternative Python wrapper for 
libxml2 that among other things aims at automatic memory management, 
I've so far worked around this issue by sharing dictionaries between all 
trees, whether they be created through reading in XMl, or created from 
scratch.

This works, though with the disadvantage of an endlessly growing 
dictionary as more and more trees are read in, and potential threading 
issues.

I recently discovered another disadvantage: if I understand it 
correctly, XSLT processing, in xsltNewTransformContext, creates a new 
dictionary in its context, with the style's signature as the subdictionary.

  Up to there I agree, but I don't understand "style's signature".
If you meant the "style's dictionnary", then yes.

The style's signature is based on the style document's signature.

The style document's signature in lxml is of course the single shared 
dictionary, but unfortunately it seems to be unavoidable (except by 
rewriting xsltNewTransformContext itself) to use the shared dictionary 
for the result of XSLT processing. This means that nodes from the tree 
resulting from the XSLT processing cannot be safely moved to other trees 
which do use the single shared dictionary.

  The reason is that libxslt makes the garantee that a compiled stylesheet
is read-only when used to make a transformation. It also avoid problems
of shared resources in multi-threaded XSLT engines.

Would the developers be open to me suggesting changes to the XSLT 
codebase to make this possible again? I suppose I should ask on the XSLT 

  Yes that should be doable. I'm not sure what would be the best API for this.

list, so let's move on to the real purpose of this mail.

Exploring these issues made me conclude that it's time to at least look 
at the alternative to sharing a single global dictionary, redicting 
parts of trees. A redicting operation would take place whenever a node 
is moved into a new tree. All the strings in the subtree below this node 
will be traced to the originating document's dictionary, and the entries 
will be copied into the target document's dictionary. Additionally, all 
string references in the subtree will be made to point to the new 
document's dictionary.

  Yes, it seems that at the DOM level this operation is called an import
based on some PHP/javascript examples I saw recently. I think that if we
add this then we should try to match the existing semantic of those
operation in PHP for example. The thing which need to be checked
when preparing for such an import are:
    - doc remapping
    - dictionnary remapping
    - namespace references to the original document
    - namespace remapping to the local document
    - entities reference to the original document
 I think those are the only pointers which are added to the pure tree
oriented parent/child/sibling ones.
  Looking at the import implementation of PHP5 might give us an idea
of how to implement this. Note that there are incomplete APIs
dealing either just with document pointers (xmlSetTreeDoc) or just
namespaces (xmlNewReconciliedNs and xmlReconciliateNs).

Redicting is a potentially expensive operation, but I think that for 
lxml's purposes, it's may be worth it to incur this cost. I also have 
hopes that there are still ways to optimize this, so that redicting 
doesn't have to happen in all cases -- after all, dictionary sharing 
works and pools of documents might end up sharing a single dictionary.

  yes, agreed. A lot of heuristics can be used to avoid some of the costs
if used in a systematic fashion.

In order to write a good redicting operation, I'd need a bit more 
information about which information in a tree exactly can end up in a 
dictionary. If someone would be able to give me a list of what ends up 
in the dictionary, that would be extremely helpful.

  All markup names, all namespaces strings (prefix and namespace names)
and some text node content (so that all "formatting nodes" used to indent
share as much as possible, or very short text nodes for example "0" or "1").

  Hope it helps, and thanks !

Daniel

P.S.: I think I should be able to design a method to make importing strings
      from a given dictionnary into python strings quite faster for repeatedly
      querying the same set of strings. The principle would be to add an
      API to the dictionnary returning an index for the string (cost O(1))
      and at the python binding level have an array keeping pointers to
      the strings already converted (Py_INCREF'ed of course).

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]