[xml] Relaxed entities encoding for html output

I am working on a project where we are using either libxslt or xalan for xslt transformations.
We have internally deprecated xalan because libxslt is considerably faster, and all other xml processing is 
performed by libxml2.
We now would like to drop xalan completely, but there is one important case where both libraries are 
producing a different output, which prevents us from doing so.

Consider whatever xml file and the following style sheet :

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version='1.0'>
<xsl:output method="html"/>
<xsl:variable name="apache">&lt;!--apache-stuff--></xsl:variable>
<xsl:variable name="script">&amp;{My script};</xsl:variable>

<xsl:template match="/">
    <a href="{$apache}/page.html" onMouseUp="{$script}">link</a>


libxml2/libxslt currently produce the following file from the transformation:

<a href="&lt;!--apache-stuff--&gt;/page.html" onMouseUp="&amp;{My script};">link</a>

And Xerces/Xalan are producing:

<a href="<!--apache-stuff-->/page.html" onMouseUp="&{My script};">link</a>

The <!--apache-stuff--> part is supposed to be replaced by the web server for load balancing purpose, but 
this is not happening when using libxslt because of the escaping (&lt; &gt;),
And that is the issue we're running into.

I have tracked it down, and the problem lies within libxml2, not libxslt (hence why I am posting on this 
list!), when the node tree is serialized to text. The enclosed patches are fixing this, and are also 
implementing a TODO that you had in the code:

The html output method should not escape a & character occurring in an attribute value immediately followed 
by a { character (see Section B.7.1 of the HTML 4.0 Recommendation).

This is illustrated by the &{My script} part in the example above.

To get back to my issue however, I am not completely sure which behavior is actually correct, as I could not 
find if '<' and '>' are allowed in attribute values in html (I know '<' is forbidden in xml).
I run the regression tests, but they added to my confusion:
Some html tests are now failing in the test suite (runtest), but if I run:
./testHTML test/HTML/lt.html
Then the  output is a lot closer to the input file test/HTML/lt.html, which was not the case before, so this 
may mean an improvement.
If this is indeed correct, I'm of course open to any suggestion or comment you may have about the patches, 
they should apply cleanly to the git trunk.

Thank you for your work on the libraries :)

P.S: timsort.h is missing from the downloadable hourly git snapshot: libxml2-git-snapshot.tar.gz


Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne 
doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez 
le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles 
France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by 
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its 
As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed 
or falsified.
Thank you.

Attachment: entities.c.patch
Description: entities.c.patch

Attachment: entities.h.patch
Description: entities.h.patch

Attachment: HTMLtree.c.patch
Description: HTMLtree.c.patch

Attachment: tree.c.patch
Description: tree.c.patch

Attachment: tree.h.patch
Description: tree.h.patch

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]