Re: UTF-8



Hi,

Damien Donlon - Sun Ireland - Solaris Software - Localisation Engineer  <damien.donlon@sun.com> writes:

> [2] Create a tool that can check whether a file is UTF-8 encoded.
>     The tool should not be dependent on simply reading a charset field
>     within the file to see whether it says UTF-8 but by analysing the
>     byte stream. Does such a tool exist already within the community?
> 
>     I think it may be impossible to distinguish between UTF-8 and 8859-1
>     if no character is outside the 0-127 range. Can anyone confirm? Is
>     this a big problem in identifying UTF-8 encoded files?

this is correct. The 7bit ASCII encoding which is used in the 0-127
range of the ISO-8859-1 encoding (and others?) is a subset of
UTF-8. But I don't see any problem here since an ISO-8859-1 encoded
file that uses nothing but the characters from the 0-127 range is at
the same time a valid UTF-8 encoded file.

The standard 'file' utility seems to do a decent job at detecting
UTF-8 encoded file. It fails to distinguish some other encodings
correctly but some quick tests I did showed no false positive or
negative for UTF-8.


Salut, Sven



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]