Re: ustring::validate() costs?



Stephan Puchegger <stephan puchegger univie ac at> writes:

>> It would be nice to figure out in my program for /each/ file I read,
>> in which character set it is encoded. Is this possible? I only found
>> functions so far which can either read the locale's character set or
>> check if some filename is valid UTF-8 (or not), but no function
>> which individually probes for a certain file in which character set
>> its filename is encoded.
>
> I am no expert in "character-string encodings", but I guess that this is 
> not possible, since no string contains information about the actual 
> encoding type. The only thing it contains is the encoded string itself. 
> The encoding type is usually taken from the locale if I am not 
> completely mistaken.

This is a old thread but: most browsers have an auto-detect option for
guessing the encoding of web pages because web authors are sloppy and
forget to specify what encoding they are using.

With file names, you would only have very little data to base the
guess on, but on the other hand you probably only have to worry about
two encodings, UTF-8 and the encoding of the locale. So it should be
doable. In fact, I think trying UTF-8 and then falling back to the
encoding of the locale if the file name is not valid UTF-8 will get
you through most cases intact, at least for European languages.

Heuristics make the world spin,

-- 
Ole Laursen
http://www.cs.aau.dk/~olau/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]