[xslt] xsltproc and UTF-8 multi-byte

From: Rick Kwan <kwanrj02 lightsaber com>
To: xslt gnome org
Cc: Rick Kwan <kwanrj02 lightsaber com>
Subject: [xslt] xsltproc and UTF-8 multi-byte
Date: Tue, 26 Nov 2002 10:08:31 -0800

Greetings folks,

I'm trying to use xsltproc on Solaris 8 to transform a XML file of
UTF-8 multi-byte text.  In the resulting, instead of UTF-8, I see
numeric character entities which I think are the equivalent Unicode
UCS-2.  For example, a line starts with
     <para>\343\200\200\343\200\201...<para>
and becomes
     <para>&#x3000;&#x3001;...</para>
(The source is really entered as six real bytes, not a string of six
escaped octals as shown above.  If you cat the file in C locale, that
text is thoroughly unreadable.)

(For those who are multi-byte conversant, these codepoints were
taken from GB-2312 (Simplified Chinese EUC) codeset 0xa1a1 and
0xa1a2.  They are the first two multi-byte codepoints.)

I would like to just see the original UTF-8 text (in its transformed
XML, of course).  I suspect this ought to be really easy, but I'm
completely missing it.  Any suggestion?

Given the encoding of the test sample, I can't show it in this mail.
But I can send it as a tar.gz attachment to anyone who is interested.
I consider it small (30 lines) and reasonably self-documenting.

--Rick Kwan

Follow-Ups:
- Re: [xslt] xsltproc and UTF-8 multi-byte
  - From: William M. Brack

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]