Re: UTF-8

From: Ole Laursen <olau hardworking dk>
To: gnome-i18n gnome org
Subject: Re: UTF-8
Date: 10 Jul 2002 19:32:45 +0200

Damien Donlon - Sun Ireland - Solaris Software - Localisation Engineer <damien.donlon@sun.com> writes:
>     I think it may be impossible to distinguish between UTF-8 and 8859-1
>     if no character is outside the 0-127 range. Can anyone confirm? Is
>     this a big problem in identifying UTF-8 encoded files?

Yep. And no - if there's no characters above 127, it doesn't matter
how you interpret the file so it is really not a problem. :-)

The big problem is to decide whether the characters above 127 are part
of the UTF-8 encoding or just ordinary characters. This is in general
unsolvable, but with some clever coding you could specialize for a lot
of different languages, I suspect. Danish text contains 'æ', 'ø', and
'å', for instance, so if you spot the UTF-8 equivalents 'Ã¦', 'Ã¸' and
'Ã¥', you can be pretty sure that it is UTF-8 and not ISO 8859-1. This
is how I do myself. ;-)

-- 
Ole Laursen
http://sunsite.dk/olau/

References:
- UTF-8
  - From: Christian Rose
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Carlos Perelló Marín
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Sven Neumann
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Sven Neumann
- Re: UTF-8
  - From: Damien Donlon - Sun Ireland - Solaris Software - Localisation Engineer

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]