Re: hilghlighting



On Wed, 12 May 2010 18:29:58 +0100
Robert Pearce <rob bdt-home demon co uk> wrote:

> Because Linux is natively UTF-8 and therefore handles UTF-8 strings.
> MinGW sits on top of Windows, which is UCS-16 - that is, any unicode
> string must use wide characters throughout, so any "normal" string has
> to be translated. The default behaviour of Windows is to assume that
> such traditional strings are CP1252 or some such, and therefore make
> the wrong translation to UCS when presented with UTF-8.

This thread has been full of incorrect information, so please read
these before discussing anything:
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
http://en.wikipedia.org/wiki/UTF-16
http://en.wikipedia.org/wiki/UTF-8

Windows has been using UTF-16 since win2k (it was using UCS-2 before that).
UTF-16 is compatible with UCS-2 (that is, it can hold any Unicode code point
that can be expressed via UCS-2), but the reverse is not true.

UTF-8 is not compatible with anything except ASCII (IIRC).

Any Unicode code point can be expressed in UTF-8 and UTF-16, but not UCS-2.

Environments which use UTF-8 (variable-width encoding, 1, 2, 3 or 4 bytes
per code point):
GTK+, most of the web, XML (by default), text-mode Linux (usually).

Environments which use UTF-16 (variable-width encoding, 2 or 4 bytes per
code point):
Newer Java, .NET, Qt, Windows (>= 2k).

Environments which use UCS-2 (fixed-width encoding, 2 bytes per code point):
Older Java, Windows (< 2k).

There's also UTF-32 (sames as UCS-4), a fixed-width 4-byte encoding,
but I'm not sure if anyone uses it.


Personally, I like UTF-8 most, since every ASCII file can be read as UTF-8,
and every UTF-8 string can be stored in plain std::string (or 0-terminated char*).


Cheers,
Alexander


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]