Re: charset code conversion in I/O



KUSANO Takayuki <AE5T-KSN asahi-net or jp> writes:

> > I think we'll be adding some more conversion utility functions:
> >  
> >  - locale <=> UTF-8  [ Pretty simple, Robert Brady has a patch ]
> >  - read a file a line at a time, converting on the fly
> >  - read a whole file into a string, converting on the fly
> > 
> > I'm not sure about adding conversion into g_io_channel - I think
> > we'd have to change the interfaces to do better error reporting
> > for one thing, and it would be a fairly major job to get right - 
> > probably too much for Glib-2.0.
> 
>   How about generic interfaces for 'Auto detection' of encodings?
>   Such functions may be useful for some applications such as
>   web browsers (especially, based on gtkhtml widget), text editors.
>   Mozilla has such classes.

It would be a useful thing to have, however, it also fairly
challenging to implement in general. (How often have you seen Netscape
get its autodetection wrong?)

What I thought of doing is having a function that does autodection,
but only the most trivial autodection:

 - Does it have the a BOM identifying it as UTF-16, either
   endianness?  
 - Is it UTF-8? (Easy because of the structure of UTF-8.)
 - If it is not UTF-8, could it be the locale's encoding? 
   (Does conversion from locale-encoding work?)
 - If not, give up and return an error. The program can ask
   the user to pick an encoding.

I'm not sure about going farther and doing frequency counts, and
so forth and to try to guess the encoding from that.

I also don't know what one can do about trying to guess the encoding
of a stream - the trouble is that you really need to have access to
the whole file before you can tell if it is say, UTF-8, or latin-1.

Regards,
                                        Owen





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]