Re: UString and locale conversion
- From: Jonathon Jongsma <jonathon quotidian org>
- To: nemiver list <nemiver-list gnome org>
- Subject: Re: UString and locale conversion
- Date: Sat, 11 Apr 2009 22:26:56 -0500
Jonathon Jongsma wrote:
So, I mentioned to Dodji on IRC today that I had uncovered a major
performance issue with the gdbmi parsing. The core of the problem was
as follows:
LOG_D ("getting out at char '"
<< (char)a_input.c_str ()[cur]
<< "', at offset '"
<< (int)cur
<< "' for text >>>"
<< m_priv->input.raw()
<< "<<<",
GDBMI_PARSING_DOMAIN);
Now, this doesn't look especially suspicious, and in fact it looks like
we may have even added the input.raw() call because we thought that
would *increase* performance, but it happens to be a huge performance
problem. What happens is this:
LogStream only defines operator<<() for const char* and UString, so
there's no explicit overload for std::string. So when we try to log a a
std::string, it implicitly converts it to a UString. So, we just
converted a UString to a std::string by calling raw(), and now we're
converting it back. And if that wasn't bad enough, the
UString(std::string) constructor actually does an automatic locale
conversion (i.e. it calls Glib::locale_to_utf8()). Not only is that
wrong (the std::string we're passing to the constructor here is not in
locale encoding, it's in utf-8 encoding), but it also kills performance.
So after a brief discussion on IRC, we agreed that the
UString(std::string) constructor should be marked explicit so this sort
of thing doesn't happen accidentally. However, there is still the
question of whether the Glib::UString(std::string) constructor should
actually do locale conversion. I was a bit ambivalent on this point
when we talked about it on IRC earlier, but after changing the
constructor to 'explicit' and going through and fixing all of the build
failures this caused, I'm starting to come to the conclusion that it's a
bad idea.
for example, consider the following code:
bool ensure_buffer_is_in_utf8 (const UString &a_path,
const std::string &a_input,
UString &a_output,
std::string &a_current_charset)
{
LOG_FUNCTION_SCOPE_NORMAL_DD;
NEMIVER_TRY
UString buf_content;
if (is_buffer_valid_utf8 (a_input.c_str (), a_input.size ())) {
a_output = a_input; /// << this line no longer compiles
return true;
}
Notice the subtle bug that we uncovered by making the constructor
explicit. Previously, the line 'a_output = a_input;' was able to
compile because presumably a_input was implicitly converted from a
std::string to a UString. But remember that constructor does a locale
conversion. So the logic of this code was essentially: "if the input is
valid utf8, convert it from the current locale to utf8". This is quite
obviously wrong, but notice that there's really no elegant way to fix
this code. There's no direct way to convert a std::string to the
desired UString output type without doing a locale conversion. You
could do something like this, but it's rather ugly:
'a_output = UString(a_input.c_str());'
If somebody that didn't know the internals of the UString(std::string)
and UString(const char*) constructors looked at this code, they would
wonder why we added the c_str() call there, and it's quite possible that
they would just remove that while doing some refactoring, and introduce
a bug accidentally.
So I'm currently leaning quite strongly toward the idea that doing a
conversion here is a bad idea. But at the same time, i think it would
probably be a rather large effort to change it at this point...
Dodji, any thoughts?
OK, so dodji and I talked about this on IRC again a bit and I believe we both agreed on
removing the automatic conversion. So I have made this change and made the constructor
explicit, made the changes to LogStream mentioned above, and updated all code to compile
properly with these changes. Since the change is fairly invasive, I'd like to have some
review of it before pushing it to the repository, so I've put it up in my user repository
again:
git remote add jonner http://www.gnome.org/~jjongsma/git/nemiver.git/
git remote update
look at branch jonner/ustring-cleanup
This should speed up a lot of long parsing performance issues, so I'd like to get this in
soon. If you'd prefer I can just push it directly without review, but it seems safer to
have at least a little review.
--
jonner
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]