Re: [xml] Relaxed entities encoding for html output



On Tue, Sep 04, 2012 at 06:36:12PM +0200, rbondue ext orange com wrote:
Hello,
I am working on a project where we are using either libxslt or xalan for xslt transformations.
We have internally deprecated xalan because libxslt is considerably faster, and all other xml processing is 
performed by libxml2.
We now would like to drop xalan completely, but there is one important case where both libraries are 
producing a different output, which prevents us from doing so.

Consider whatever xml file and the following style sheet :


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version='1.0'>
<xsl:output method="html"/>
<xsl:variable name="apache">&lt;!--apache-stuff--></xsl:variable>
<xsl:variable name="script">&amp;{My script};</xsl:variable>

<xsl:template match="/">
    <a href="{$apache}/page.html" onMouseUp="{$script}">link</a>
</xsl:template>

</xsl:stylesheet>



libxml2/libxslt currently produce the following file from the transformation:

<a href="&lt;!--apache-stuff--&gt;/page.html" onMouseUp="&amp;{My script};">link</a>


And Xerces/Xalan are producing:

<a href="<!--apache-stuff-->/page.html" onMouseUp="&{My script};">link</a>


The <!--apache-stuff--> part is supposed to be replaced by the web server for load balancing purpose, but 
this is not happening when using libxslt because of the escaping (&lt; &gt;),
And that is the issue we're running into.

I have tracked it down, and the problem lies within libxml2, not libxslt (hence why I am posting on this 
list!), when the node tree is serialized to text. The enclosed patches are fixing this, and are also 
implementing a TODO that you had in the code:

The html output method should not escape a & character occurring in an attribute value immediately followed 
by a { character (see Section B.7.1 of the HTML 4.0 Recommendation).

This is illustrated by the &{My script} part in the example above.

To get back to my issue however, I am not completely sure which behavior is actually correct, as I could 
not find if '<' and '>' are allowed in attribute values in html (I know '<' is forbidden in xml).
I run the regression tests, but they added to my confusion:
Some html tests are now failing in the test suite (runtest), but if I run:
./testHTML test/HTML/lt.html
Then the  output is a lot closer to the input file test/HTML/lt.html, which was not the case before, so 
this may mean an improvement.
If this is indeed correct, I'm of course open to any suggestion or comment you may have about the patches, 
they should apply cleanly to the git trunk.

  Your approach is way too heavy, instead of changing < and & in all
case detecting the full construct first and then special processing
those case is really less disruptive. With that approach no other
test case in libxml2 or libxslt fails. So I commited that restricted
approach but which should handle the cases you raise.

  http://git.gnome.org/browse/libxml2/commit/?id=7d4c529a334845621e2f805c8ed0e154b3350cec

thinkpad:~/XSLT -> xsltproc/xsltproc orange.xsl orange.xsl
<a href="<!--apache-stuff-->/page.html" onMouseUp="&{My script};">link</a>
thinkpad:~/XSLT -> 

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]