[xml] redicting parts of trees
- From: Martijn Faassen <faassen infrae com>
- To: xml gnome org
- Subject: [xml] redicting parts of trees
- Date: Fri, 13 May 2005 23:15:08 +0200
Hi there,
libxml2 documents use dictionaries to hash commonly used strings in XML
documents, for reasons of space and time efficiency.
Unfortunately this makes it impossible to safely move a node from one
tree to another.
In the process of writing lxml, an alternative Python wrapper for
libxml2 that among other things aims at automatic memory management,
I've so far worked around this issue by sharing dictionaries between all
trees, whether they be created through reading in XMl, or created from
scratch.
This works, though with the disadvantage of an endlessly growing
dictionary as more and more trees are read in, and potential threading
issues.
I recently discovered another disadvantage: if I understand it
correctly, XSLT processing, in xsltNewTransformContext, creates a new
dictionary in its context, with the style's signature as the subdictionary.
The style's signature is based on the style document's signature.
The style document's signature in lxml is of course the single shared
dictionary, but unfortunately it seems to be unavoidable (except by
rewriting xsltNewTransformContext itself) to use the shared dictionary
for the result of XSLT processing. This means that nodes from the tree
resulting from the XSLT processing cannot be safely moved to other trees
which do use the single shared dictionary.
Would the developers be open to me suggesting changes to the XSLT
codebase to make this possible again? I suppose I should ask on the XSLT
list, so let's move on to the real purpose of this mail.
Exploring these issues made me conclude that it's time to at least look
at the alternative to sharing a single global dictionary, redicting
parts of trees. A redicting operation would take place whenever a node
is moved into a new tree. All the strings in the subtree below this node
will be traced to the originating document's dictionary, and the entries
will be copied into the target document's dictionary. Additionally, all
string references in the subtree will be made to point to the new
document's dictionary.
Redicting is a potentially expensive operation, but I think that for
lxml's purposes, it's may be worth it to incur this cost. I also have
hopes that there are still ways to optimize this, so that redicting
doesn't have to happen in all cases -- after all, dictionary sharing
works and pools of documents might end up sharing a single dictionary.
In order to write a good redicting operation, I'd need a bit more
information about which information in a tree exactly can end up in a
dictionary. If someone would be able to give me a list of what ends up
in the dictionary, that would be extremely helpful.
Thank you,
Martijn
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]