[xslt] xsltproc and UTF-8 multi-byte



Greetings folks,

I'm trying to use xsltproc on Solaris 8 to transform a XML file of
UTF-8 multi-byte text.  In the resulting, instead of UTF-8, I see
numeric character entities which I think are the equivalent Unicode
UCS-2.  For example, a line starts with
     <para>\343\200\200\343\200\201...<para>
and becomes
     <para>&#x3000;&#x3001;...</para>
(The source is really entered as six real bytes, not a string of six
escaped octals as shown above.  If you cat the file in C locale, that
text is thoroughly unreadable.)

(For those who are multi-byte conversant, these codepoints were
taken from GB-2312 (Simplified Chinese EUC) codeset 0xa1a1 and
0xa1a2.  They are the first two multi-byte codepoints.)

I would like to just see the original UTF-8 text (in its transformed
XML, of course).  I suspect this ought to be really easy, but I'm
completely missing it.  Any suggestion?

Given the encoding of the test sample, I can't show it in this mail.
But I can send it as a tar.gz attachment to anyone who is interested.
I consider it small (30 lines) and reasonably self-documenting.

--Rick Kwan




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]