Re: ustring::validate() costs?



On Thursday 01 December 2005 20:41, Matthias Kaeppler wrote:
> Hey guys,
>
> I am reading filenames from the harddisk which may or may not be in
> UTF-8 encoding. So, since Gtk+ and Glib naturally expect UTF-8, I
> somehow have to make sure my code doesn't break when the user's
> filenames are encoded differently.
>
> I spent quite some time searching the web and reading source code and
> documentation how to do this properly, and what I could figure out so
> far is this:
>
> If G_BROKEN_FILENAMES is set to 1 in the environment, then
> g_filename_to_utf8 will try to convert from the current locale to UTF-8,
> otherwise, the string is copied 1:1. For some reason this variable isn't
> mentioned in the documentation of the Glib character set conversion
> functions, so maybe my information is outdated--It only mentions
> G_FILENAME_ENCODING to determine the character set when you're
> converting /from/ UTF-8 to the locale's encoding.
>
> Anyway, I really don't want to force the user to set some obscure
> environment variable just so the program will work for him (since there
> are still users who do not use UTF-8 yet this is just not acceptable).
>
> So I thought I could do this:
> For every file I read, I first check if it's valid UTF-8 using
> ustring::validate(). If it isn't, I get the locale's character encoding
> with Glib::get_charset() and pass it to
> Glib::setenv("G_FILENAME_ENCODING", result_of_get_charset). Otherwise I
> set the env-variable to "" again. Bottom line, in any case the call to
> Glib::filename_to_utf8() will succeed (that's the intention at least).
>
> This way I can be sure that even files with mixed encodings (UTF-8 and
> non-UTF-8) are converted correctly, plus I don't need to force the user
> to supply these values.
>
> However, I'm concerned about runtime costs. How exactly does validate()
> work? How expensive is it to call on say 1000 files?

Are you worried about the codeset of a file's contents or of its filename?  
You begin by referring to filenames, but you appear to end by referring to 
the codeset in which a file has been written to.  All filenames in any one 
system will use the same codeset - you cannot have "files with mixed 
encodings", as you put it, in that sense.

If you are worried about a file's contents, I cannot comment on the time taken 
to run Glib::ustring::validate(), but if you do not need the result stored in 
a Glib::ustring object it would probably be faster to call the glib function 
g_utf8_validate() directly (which does not require constructing a 
Glib::ustring object to use), and check the input file line by line.  If all 
you want to do is to validate a filename then the call will take very little 
time and have no practicable expense, whether you use g_utf8_validate() of 
Glib::ustring::validate().

On some other points, it is much better for your program to store data to 
file, and retrieve it, entirely in UTF-8.  It is only if you are obtaining 
data from files written to by other programs in the same system, which you 
know might be in the system's locale codeset and that might not comprise 
UTF-8, that you may have to consider codeset conversion.  (For files written 
to by other remote systems the codeset may not be the either the local 
system's locale codeset nor UTF-8, so calling Glib::locale_to_utf8() would 
not necessarily do what you want it to do in any case).

In any event, calling Glib::get_charset() to obtain the local system's locale 
codeset seems pointless.  Glib::locale_to_utf8() will do this test for you 
and do nothing if the locale codeset happens to be UTF-8.

If all you want to do is to force a conversion of a filename from the locale 
codeset to UTF-8 and you don't want to bother with the G_BROKEN_FILENAMES or 
G_FILENAME_ENCODING environmental variables, just use Glib::locale_to_utf8() 
(this will have the same effect as calling Glib::filename_to_utf8() with the 
G_BROKEN_FILENAMES environmental variable set).  You lose the flexibility of 
being able to cater for the locale codeset and the filename codeset being 
different, but how many systems would do something as insane as that anyway?

The G_BROKEN_FILENAMES environmental variable is accepted by all versions of 
glib-2.  Only at some later point in the glib-2 release cycle was the 
G_FILENAME_ENCODING variable adopted (it doesn't work with glib-2.0, but I do 
not know at what point between glib-2.0 and glib-2.8 it arrived).  The 
G_FILENAME_ENCODING environmental variable works both ways - it determines 
the operation of both Glib::filename_to_utf8() and Glib::filename_from_utf8
(), and as far as I am aware G_BROKEN_FILENAMES does the same.

Chris




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]