UString and locale conversion
- From: Jonathon Jongsma <jonathon quotidian org>
- To: nemiver list <nemiver-list gnome org>
- Subject: UString and locale conversion
- Date: Thu, 09 Apr 2009 23:52:29 -0500
So, I mentioned to Dodji on IRC today that I had uncovered a major performance issue with
the gdbmi parsing. The core of the problem was as follows:
LOG_D ("getting out at char '"
<< (char)a_input.c_str ()[cur]
<< "', at offset '"
<< (int)cur
<< "' for text >>>"
<< m_priv->input.raw()
<< "<<<",
GDBMI_PARSING_DOMAIN);
Now, this doesn't look especially suspicious, and in fact it looks like we may have even
added the input.raw() call because we thought that would *increase* performance, but it
happens to be a huge performance problem. What happens is this:
LogStream only defines operator<<() for const char* and UString, so there's no explicit
overload for std::string. So when we try to log a a std::string, it implicitly converts
it to a UString. So, we just converted a UString to a std::string by calling raw(), and
now we're converting it back. And if that wasn't bad enough, the UString(std::string)
constructor actually does an automatic locale conversion (i.e. it calls
Glib::locale_to_utf8()). Not only is that wrong (the std::string we're passing to the
constructor here is not in locale encoding, it's in utf-8 encoding), but it also kills
performance.
So after a brief discussion on IRC, we agreed that the UString(std::string) constructor
should be marked explicit so this sort of thing doesn't happen accidentally. However,
there is still the question of whether the Glib::UString(std::string) constructor should
actually do locale conversion. I was a bit ambivalent on this point when we talked about
it on IRC earlier, but after changing the constructor to 'explicit' and going through and
fixing all of the build failures this caused, I'm starting to come to the conclusion that
it's a bad idea.
for example, consider the following code:
bool ensure_buffer_is_in_utf8 (const UString &a_path,
const std::string &a_input,
UString &a_output,
std::string &a_current_charset)
{
LOG_FUNCTION_SCOPE_NORMAL_DD;
NEMIVER_TRY
UString buf_content;
if (is_buffer_valid_utf8 (a_input.c_str (), a_input.size ())) {
a_output = a_input; /// << this line no longer compiles
return true;
}
Notice the subtle bug that we uncovered by making the constructor explicit. Previously,
the line 'a_output = a_input;' was able to compile because presumably a_input was
implicitly converted from a std::string to a UString. But remember that constructor does
a locale conversion. So the logic of this code was essentially: "if the input is valid
utf8, convert it from the current locale to utf8". This is quite obviously wrong, but
notice that there's really no elegant way to fix this code. There's no direct way to
convert a std::string to the desired UString output type without doing a locale
conversion. You could do something like this, but it's rather ugly:
'a_output = UString(a_input.c_str());'
If somebody that didn't know the internals of the UString(std::string) and UString(const
char*) constructors looked at this code, they would wonder why we added the c_str() call
there, and it's quite possible that they would just remove that while doing some
refactoring, and introduce a bug accidentally.
So I'm currently leaning quite strongly toward the idea that doing a conversion here is a
bad idea. But at the same time, i think it would probably be a rather large effort to
change it at this point...
Dodji, any thoughts?
--
jonner
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]