Re: charset code conversion in I/O



> It would be a useful thing to have, however, it also fairly
> challenging to implement in general. 

  I think so. But I imagine that generic framework can be designed and
will be useful.

> (How often have you seen Netscape get its autodetection wrong?)

  Yes, sometimes Netscape fails to detect encoding. For example,
ISO-8859-1 encoded pages are recognized as Shift_JIS encoded. But it
is not a big problem, I think.  Such feature actually ease my life,
because most web servers (pages) do not send 'charset=' parameter in
HTTP response (or <META HTTP-EQUIV=..> tag in HTML header part).

> What I thought of doing is having a function that does autodection,
> but only the most trivial autodection:
> 
>  - Does it have the a BOM identifying it as UTF-16, either
>    endianness?  
>  - Is it UTF-8? (Easy because of the structure of UTF-8.)
>  - If it is not UTF-8, could it be the locale's encoding? 
>    (Does conversion from locale-encoding work?)
>  - If not, give up and return an error. The program can ask
>    the user to pick an encoding.
> 
> I'm not sure about going farther and doing frequency counts, and
> so forth and to try to guess the encoding from that.

  If the API is desinged as configurable (pluggable), someone wrote
language (set of encodings) specific detector and call them via the
API.

> I also don't know what one can do about trying to guess the encoding
> of a stream - the trouble is that you really need to have access to
> the whole file before you can tell if it is say, UTF-8, or latin-1.

  Setting some limit on length will be help. If the detector cannot
confirm within the range, it can return some default encoding or list
of encodings with calculated score.

  KUSANO Takayuki <URL:http://www.asahi-net.or.jp/~AE5T-KSN/>




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]