Re: UTF-8



Damien Donlon - Sun Ireland - Solaris Software - Localisation Engineer <damien.donlon@sun.com> writes:
>     I think it may be impossible to distinguish between UTF-8 and 8859-1
>     if no character is outside the 0-127 range. Can anyone confirm? Is
>     this a big problem in identifying UTF-8 encoded files?

Yep. And no - if there's no characters above 127, it doesn't matter
how you interpret the file so it is really not a problem. :-)

The big problem is to decide whether the characters above 127 are part
of the UTF-8 encoding or just ordinary characters. This is in general
unsolvable, but with some clever coding you could specialize for a lot
of different languages, I suspect. Danish text contains 'æ', 'ø', and
'å', for instance, so if you spot the UTF-8 equivalents 'æ', 'ø' and
'Ã¥', you can be pretty sure that it is UTF-8 and not ISO 8859-1. This
is how I do myself. ;-)

-- 
Ole Laursen
http://sunsite.dk/olau/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]