Re: [xslt] xsl:output/@encoding may produce character references in element and attribute names

Daniel Veillard schrieb:
On Fri, Jul 04, 2008 at 12:57:40PM +0200, Michael Ludwig wrote:
I stumbled upon an oddity in LibXSLT: Element and attribute names end
up containing character references in the output when the characters
are not available in the selected output encoding.

This oddity is actually a bug, so I reported it here:

  You ask for something impossible.

That's true, but when asking I didn't know it was impossible :-)

You get a non-xml document instead of getting an immediate failure.
  It's a trade-off, unrelated to libxslt, it's actually in libxml2.
The transcoding is done on a preserialized UTF-8 document (or document
fragment), detecting the error means each time a character is not
serializable in the target encoding, when issuing the escaped sequence
to do a rewind lookup and try to guess (it's guessing because at that
point you're manipulating strings there is no notion of document
structure) if you're within markup or within content.

That's definitely too late in order to detect the error.

  Basically it makes everybody pay a rather hight cost for the few who
asked for something impossible.

I, for one, won't ask again :-)

  The current state is there since the beginning of libxml2 (nearly a
decade) so the bug is extremely uncommon. This makes me even less
comfortable with the expansion of the cost. Again, it's a trade-off, a
concious one, for more informations see libxml2 encoding.c around line
2057 that's where the escaping is done. If you see another way to
handle this not penalizing heavilly the normal process, I'm all for
fixing this. But right now I don't see a solution.

I don't know if that's viable, or efficient, and chances are high it
isn't as I've had hardly any exposure to C programming, but anyway
here's what I'm thinking after reading your reply:

(1) As far as LibXSLT is concerned, the only way this error may ever
occur is when an output encoding other than UTF-8, UTF-16 or UTF-32 is
specified, so actually when a character repertoire other than Unicode is

(2) If this is the case, element and attribute names are to be checked
when building the node tree to ensure they can be represented in the
selected encoding. If not, a flag is set on the structure representing
the element or attribute node. (Just making wild guesses - it might be
out of the question to introduce such a thing in a working library, both
for maintenance and performance reasons.) In addition, on encountering
the first element or attribute node, some flag is set (or some counter
is incremented) to signal that there are indeed nodes with inadmissible
names in the document and extra care has to be taken when transforming
the document.

(3) If this flag, or counter, is set, an extra check is performed during
the transformation phase whenever an element or attribute node is copied
to the output to ensure the name is representable in the selected
encoding. If not so, a run-time error is reported and the transformation
is aborted.

Thanks for reading this. It is just naive speculation on my part without
any knowledge of how LibXSLT actually works.

In addition, the oddity I've encountered may also arise in other
situations that do not have anything to do with LibXSLT. I don't know
if this is the case. If so, however, a more general solution might be
desirable, and this would probably become more complicated.

Thanks anyway for a great XML/XSLT library, and thanks for still caring
after all these years.

Michael Ludwig

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]