[xml] redicting parts of trees

From: Martijn Faassen <faassen infrae com>
To: xml gnome org
Subject: [xml] redicting parts of trees
Date: Fri, 13 May 2005 23:15:08 +0200

Hi there,

libxml2 documents use dictionaries to hash commonly used strings in XMLdocuments, for reasons of space and time efficiency.

Unfortunately this makes it impossible to safely move a node from onetree to another.

In the process of writing lxml, an alternative Python wrapper forlibxml2 that among other things aims at automatic memory management,I've so far worked around this issue by sharing dictionaries between alltrees, whether they be created through reading in XMl, or created fromscratch.

This works, though with the disadvantage of an endlessly growingdictionary as more and more trees are read in, and potential threadingissues.

I recently discovered another disadvantage: if I understand itcorrectly, XSLT processing, in xsltNewTransformContext, creates a newdictionary in its context, with the style's signature as the subdictionary.


The style's signature is based on the style document's signature.

The style document's signature in lxml is of course the single shareddictionary, but unfortunately it seems to be unavoidable (except byrewriting xsltNewTransformContext itself) to use the shared dictionaryfor the result of XSLT processing. This means that nodes from the treeresulting from the XSLT processing cannot be safely moved to other treeswhich do use the single shared dictionary.

Would the developers be open to me suggesting changes to the XSLTcodebase to make this possible again? I suppose I should ask on the XSLTlist, so let's move on to the real purpose of this mail.

Exploring these issues made me conclude that it's time to at least lookat the alternative to sharing a single global dictionary, redictingparts of trees. A redicting operation would take place whenever a nodeis moved into a new tree. All the strings in the subtree below this nodewill be traced to the originating document's dictionary, and the entrieswill be copied into the target document's dictionary. Additionally, allstring references in the subtree will be made to point to the newdocument's dictionary.

Redicting is a potentially expensive operation, but I think that forlxml's purposes, it's may be worth it to incur this cost. I also havehopes that there are still ways to optimize this, so that redictingdoesn't have to happen in all cases -- after all, dictionary sharingworks and pools of documents might end up sharing a single dictionary.

In order to write a good redicting operation, I'd need a bit moreinformation about which information in a tree exactly can end up in adictionary. If someone would be able to give me a list of what ends upin the dictionary, that would be extremely helpful.


Thank you,

Martijn

Follow-Ups:
- Re: [xml] redicting parts of trees
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]