[xml] redicting parts of trees



Hi there,

libxml2 documents use dictionaries to hash commonly used strings in XML documents, for reasons of space and time efficiency.

Unfortunately this makes it impossible to safely move a node from one tree to another.

In the process of writing lxml, an alternative Python wrapper for libxml2 that among other things aims at automatic memory management, I've so far worked around this issue by sharing dictionaries between all trees, whether they be created through reading in XMl, or created from scratch.

This works, though with the disadvantage of an endlessly growing dictionary as more and more trees are read in, and potential threading issues.

I recently discovered another disadvantage: if I understand it correctly, XSLT processing, in xsltNewTransformContext, creates a new dictionary in its context, with the style's signature as the subdictionary.

The style's signature is based on the style document's signature.

The style document's signature in lxml is of course the single shared dictionary, but unfortunately it seems to be unavoidable (except by rewriting xsltNewTransformContext itself) to use the shared dictionary for the result of XSLT processing. This means that nodes from the tree resulting from the XSLT processing cannot be safely moved to other trees which do use the single shared dictionary.

Would the developers be open to me suggesting changes to the XSLT codebase to make this possible again? I suppose I should ask on the XSLT list, so let's move on to the real purpose of this mail.

Exploring these issues made me conclude that it's time to at least look at the alternative to sharing a single global dictionary, redicting parts of trees. A redicting operation would take place whenever a node is moved into a new tree. All the strings in the subtree below this node will be traced to the originating document's dictionary, and the entries will be copied into the target document's dictionary. Additionally, all string references in the subtree will be made to point to the new document's dictionary.

Redicting is a potentially expensive operation, but I think that for lxml's purposes, it's may be worth it to incur this cost. I also have hopes that there are still ways to optimize this, so that redicting doesn't have to happen in all cases -- after all, dictionary sharing works and pools of documents might end up sharing a single dictionary.

In order to write a good redicting operation, I'd need a bit more information about which information in a tree exactly can end up in a dictionary. If someone would be able to give me a list of what ends up in the dictionary, that would be extremely helpful.

Thank you,

Martijn







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]