Re: [xslt] Replacing the dict used by the transform context



Salut Daniel,

Daniel Veillard :
> On Sun, Aug 06, 2006 at 10:51:54PM +0200, Stefan Behnel wrote:
>> In lxml, we use a global per-thread dictionary for all parsers. Recently, we
>> had reports about crashes in the XSLT code when moving subtrees of results to
>> other trees. The reason for this is that libxslt creates a new sub-dictionary
>> for each transformation context that is inherited by the result document. This
>> means that the dict disappears when the document is freed. The cleanup
>> procedure for this specific document and all of its elements therefore depends
>> on the source of the original document, and it's difficult to deal with this
>> in generic cleanup code (as required by a wrapper like lxml).
>>
>> We would like to use the same approach as for parsers here, i.e. replace the
>> dict after creating the context. However, in the xsltNewTransformContext
>> function in transform.c it says right before the creation of the sub-dictionary:
>>
>>    /*
>>     * setup of the dictionnary must be done early as some of the
>>     * processing later like key handling may need it.
>>     */
>>
>> Now, I checked if this is actually true and found that the dict is not
>> accessed anywhere in that function or in any of the functions it calls. So
> 
>   The question is how did you check ?

Well, I looked through the function and then looked through all functions it
called to see if any of them contains the word "dict". And they don't. See
below on why that wasn't enough.


> I think this comment was added when
> dealing with extension functions used in keys. Now whether this was about doing
> it in that function (which is early on transformation wise) or early in the
> function (which is what you seems to have concluded) is unclear to me.

I assume you mean: "early in the function" vs. "early in the transformation
process".

At least, it's not in the function, but still, the comment is just completely
ambiguous (or maybe just outdated like - uhm - some others). A misleading
comment is worse than none.


>> this comment is at least misleading.
>>
>> I do not consider it common usage to replace the dict, so I would not request
>> an explicit API for this. However, for the case we need, it is sufficient to
>> free the dict that was created and replace it be the 'right' one.
> 
>   In general replacing the dict with a random new one may simply break the
> processor, there is a number of places where string comparison have been
> replaced with pointer comparison because we knew that a string found in the
> parent dict would be the same as the string if queried in the subdict.

That's good to know. So, what does that mean? Currently, libxslt generated
documents inherit the dictionary of the stylesheet. You say that it is
required that the transform dictionary can access the entries in the
stylesheet dictionary?


> In
> your approach in lxml it will still work because it is the same dictionnary.

Not necessarily. We use per-thread dictionaries, so according to your above
quote, sharing an XSLT between threads will not work if we use the
thread-local dict for the transformation, right?

I tried that with a simple XSLT and it worked for me, so maybe you could try
to explain in a little more detail in what case this will not work. Is it also
related to extension functions (which I did not try)?


> However the big problem of your approach as I explained to Martijn when he
> started lxml is that as a result the dictionnary can never be freed and 
> cannot be shrinked either, so you have a ever balooning resource which will
> exhaust memory after a while no matter what.

Sure, if the parsed XML is sufficiently diverse. However, as long as you stick
to a small set of XML languages (which is the most common use case, I'd say),
you should not run into too much trouble, right?


>> I would like to propose making it a feature that the dictionary is *not*
>> changed in the xsltNewTransformContext, but can be safely replaced in user
>> code right after returning the new context. The attached patch does this. It
>> does not change the behaviour of the current libxslt at all, it mainly moves
>> the dict creation to the end of the function and adds a comment that it is
>> supposed to stay there to prevent any updates before returning the context. I
>> ran "make tests" with it and the tests passed nicely.
> 
>    The main question here is about extension functions, their initialization
> passes the transformation context, which itself includes a pointer to the
> dictionnary. So the dictionnary is available within the course of the
> execution of that function to extension functions.

Ok, after a little thinking and code reading, I managed to figure out what
your last paragraph was supposed to mean. You were referring to the module
extension initialisation in "xsltInitCtxtExt", right? Ok, we don't have that
problem in lxml.


> I guess that's where
> your hope of never touching the dictionnary at that point breaks, of course
> you need a very specific stylesheet and environment but people do very crazy
> things with extension functions.

Sure, replacing the dictionary is only meant for people "who know what they
are doing". That's why I'm trying to fix certain semantics for this case, to
actually allow people to know what impact this has.


> And I cannot garantee:
>    - that your change would not break an existing deployed use of libxslt
>      the dictionnary may be accessed by extension function initialization.

Ok, I understand that now, so my patch is not worth including.


>    - that if not changing it you may never face a case where some stylesheet
>      actually access that dictionnary during the initialization phase.

All I'd like to have is some official documentation saying:

* You should not normally replace the transform dictionary.
* If you do, you must take care on your own that the extension module setup
  you use does not alter the dictionary. Apart from that, it's safe and will
  continue to be safe in later versions.

Note that replacing the dict was never allowed by libxslt (and even officially
discouraged by the comment), so if this semantic change breaks code, it's not
libxslt that's broken.


> However when xsltNewTransformContext returns you should be able to check
> if that dictionnary had been used by comparing xmlDictSize() of style->dict
> and xmlDictSize() of cur->dict. If they are the same, then no string has
> been allocated from that subdict, the only risk would be a dandling pointer
> to it in some extension function structures.

Well, but if it's not the same size, we end up having the same problem as
before. Sure, one could raise an exception telling the user that what she did
back then, when registering that long forgotten extension function setup, was
a no-no. But then, that's exposing implementation details (much better than a
crash, but well)...

Anyway, as I said, we don't have that problem in lxml.


> But maybe that discussion about extension function is moot anyway because in
> lxml context you completely control those extensions and in that case access
> to the dictionary at initialization is impossible, and I guess you would be
> safe running with the current existing code.

I think so, too. That's why I would like to have an official blessing that
this will continue to work in later versions.

Stefan


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]