Re: [xslt] xsltproc changes unicode to nonsense



Paul Tremblay writes:
> On Fri, Nov 08, 2002 at 09:34:47AM +0100, Morus Walter wrote:
> > Paul Tremblay writes:
> > > In my stlesheet, I have:
> > > 
> > > <xsl:output method="html"/>
> > > 
> > > I then put this character in my xslt stylesheet:
> > > 
> > > &#x00A0;
> > > 
> > > This should be a no break space.
> > > 
> > > However, xsltproc translates this to a upper case A with a hat over it.
> > > If I change the output method line to:
> > > 
> > No. It outputs a nonbreaking space in utf8 which is (in latin1) 
> > a upper case A with a hat over, followed by nonbreaking space.
> > The latter is a bit hard to see but it's there.
> > 
> 
> Wow. I'm really confused. What does the A with the hat over it have to
> do with a non-breaking space? 
> 
> I thought I could pick any unicode character, and a browser would have
> to represent it. I understand that not all browsers can handle every
> single unicode character, but I thought that if a browser couldn't
> handle a character, it would output a "?". 
> 
> I guess I don't understand utf8. I thought that utf8 *was* unicode. That
> is, it was a way to represent all of unicode with just 8-bit numbers.
> (Now that I think of it, even 8-bit should be wrong, since not all
> computers agree on the upper 128 in character set.) 
> 
No.
Unicode is a convention, how to map characters to numbers. E.g. #160 is
nonbreaking space. #65 is a 'A'.
UTF8 is a convention, how to store these numbers into a file.
So unicode and UTF8 are connected but not the same.
UTF8 doesn't simply store the number, it encodes the number into variable
length byte codes (number 0-127 take one byte, 128-2048 two bytes and
so on).
Now when you do this with #160 you get two bytes (xC2 and xA0) and
when you *interprete* these characters as latin1 you get the characters
named above (of course this is a misinterpretion, since you intended these
bytes to be utf8). So this is what you see in a latin1 console window or
in a latin1 expecting browser. If you have this in a browser, you
might switch the encoding to utf8 and the browser will display
your non breaking space (in mozilla this is 'view' -> 'character coding').
So in your case, part of the problem might be, that you have to tell the
browser that it's getting utf8.

You should always consider that any program just sees some bytes (containing
numbers) and interpretes them before they get charaters and readable text.
And when there are different ways how to interprete something, you will
see different results.

> Do you know any good sites that explain this?
> 
Hava a look at 
http://www.cl.cam.ac.uk/~mgk25/unicode.html
	(UTF-8 and Unicode FAQ for Unix/Linux)
http://czyborra.com/
	(Unicode in the Unix Environment)

On linux you might also have a look at the utf8 man page though this
is very technical.
 
> 
> > Provide an apropriate output encoding (such as ASCII or iso-8859-1)
> > to get '&#160;' or a literate non breaking space.
> 
> I'll have to try this. One thing that really annoys me is that I have a
> linux box, and I always get webpages full of "??" because the webpages
> assumed everone uses the same encoding scheme. I thought utf8 was a way
> to ensure this wouldn't happen. But I guess I have a thing or two to
> learn!
> 
Well, one source of the ?-problem with linux/unix browsers is that windows
user tend to use a special windows encoding and claim it's latin1.
See 
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
Linux/unix users are usually out of danger of creating these errors, since
the windows encodings aren't used under linux/unix.

The second source is insufficent encoding declaration.
If you send latin1 and the browser "thinks" it's utf8 or visa verce, you
should not be suprised, if things don't work as expected.

The most secure way not get into trouble is to use ascii and decimal
character references (&#123;).
Specifying the right encoding should work also of course.

HTH
	Morus



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]