Re: [xml] redicting parts of trees
- From: Daniel Veillard <veillard redhat com>
- To: Martijn Faassen <faassen infrae com>
- Cc: xml gnome org
- Subject: Re: [xml] redicting parts of trees
- Date: Sat, 14 May 2005 04:57:02 -0400
On Fri, May 13, 2005 at 11:15:08PM +0200, Martijn Faassen wrote:
Hi there,
libxml2 documents use dictionaries to hash commonly used strings in XML
documents, for reasons of space and time efficiency.
Unfortunately this makes it impossible to safely move a node from one
tree to another.
In the process of writing lxml, an alternative Python wrapper for
libxml2 that among other things aims at automatic memory management,
I've so far worked around this issue by sharing dictionaries between all
trees, whether they be created through reading in XMl, or created from
scratch.
This works, though with the disadvantage of an endlessly growing
dictionary as more and more trees are read in, and potential threading
issues.
I recently discovered another disadvantage: if I understand it
correctly, XSLT processing, in xsltNewTransformContext, creates a new
dictionary in its context, with the style's signature as the subdictionary.
Up to there I agree, but I don't understand "style's signature".
If you meant the "style's dictionnary", then yes.
The style's signature is based on the style document's signature.
The style document's signature in lxml is of course the single shared
dictionary, but unfortunately it seems to be unavoidable (except by
rewriting xsltNewTransformContext itself) to use the shared dictionary
for the result of XSLT processing. This means that nodes from the tree
resulting from the XSLT processing cannot be safely moved to other trees
which do use the single shared dictionary.
The reason is that libxslt makes the garantee that a compiled stylesheet
is read-only when used to make a transformation. It also avoid problems
of shared resources in multi-threaded XSLT engines.
Would the developers be open to me suggesting changes to the XSLT
codebase to make this possible again? I suppose I should ask on the XSLT
Yes that should be doable. I'm not sure what would be the best API for this.
list, so let's move on to the real purpose of this mail.
Exploring these issues made me conclude that it's time to at least look
at the alternative to sharing a single global dictionary, redicting
parts of trees. A redicting operation would take place whenever a node
is moved into a new tree. All the strings in the subtree below this node
will be traced to the originating document's dictionary, and the entries
will be copied into the target document's dictionary. Additionally, all
string references in the subtree will be made to point to the new
document's dictionary.
Yes, it seems that at the DOM level this operation is called an import
based on some PHP/javascript examples I saw recently. I think that if we
add this then we should try to match the existing semantic of those
operation in PHP for example. The thing which need to be checked
when preparing for such an import are:
- doc remapping
- dictionnary remapping
- namespace references to the original document
- namespace remapping to the local document
- entities reference to the original document
I think those are the only pointers which are added to the pure tree
oriented parent/child/sibling ones.
Looking at the import implementation of PHP5 might give us an idea
of how to implement this. Note that there are incomplete APIs
dealing either just with document pointers (xmlSetTreeDoc) or just
namespaces (xmlNewReconciliedNs and xmlReconciliateNs).
Redicting is a potentially expensive operation, but I think that for
lxml's purposes, it's may be worth it to incur this cost. I also have
hopes that there are still ways to optimize this, so that redicting
doesn't have to happen in all cases -- after all, dictionary sharing
works and pools of documents might end up sharing a single dictionary.
yes, agreed. A lot of heuristics can be used to avoid some of the costs
if used in a systematic fashion.
In order to write a good redicting operation, I'd need a bit more
information about which information in a tree exactly can end up in a
dictionary. If someone would be able to give me a list of what ends up
in the dictionary, that would be extremely helpful.
All markup names, all namespaces strings (prefix and namespace names)
and some text node content (so that all "formatting nodes" used to indent
share as much as possible, or very short text nodes for example "0" or "1").
Hope it helps, and thanks !
Daniel
P.S.: I think I should be able to design a method to make importing strings
from a given dictionnary into python strings quite faster for repeatedly
querying the same set of strings. The principle would be to add an
API to the dictionnary returning an index for the string (cost O(1))
and at the python binding level have an array keeping pointers to
the strings already converted (Py_INCREF'ed of course).
--
Daniel Veillard | Red Hat Desktop team http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]