Re: UString and locale conversion



Jonathon Jongsma wrote:
So, I mentioned to Dodji on IRC today that I had uncovered a major performance issue with the gdbmi parsing. The core of the problem was as follows:

        LOG_D ("getting out at char '"
               << (char)a_input.c_str ()[cur]
               << "', at offset '"
               << (int)cur
               << "' for text >>>"
               << m_priv->input.raw()
               << "<<<",
               GDBMI_PARSING_DOMAIN);

Now, this doesn't look especially suspicious, and in fact it looks like we may have even added the input.raw() call because we thought that would *increase* performance, but it happens to be a huge performance problem. What happens is this:

LogStream only defines operator<<() for const char* and UString, so there's no explicit overload for std::string. So when we try to log a a std::string, it implicitly converts it to a UString. So, we just converted a UString to a std::string by calling raw(), and now we're converting it back. And if that wasn't bad enough, the UString(std::string) constructor actually does an automatic locale conversion (i.e. it calls Glib::locale_to_utf8()). Not only is that wrong (the std::string we're passing to the constructor here is not in locale encoding, it's in utf-8 encoding), but it also kills performance.

So after a brief discussion on IRC, we agreed that the UString(std::string) constructor should be marked explicit so this sort of thing doesn't happen accidentally. However, there is still the question of whether the Glib::UString(std::string) constructor should actually do locale conversion. I was a bit ambivalent on this point when we talked about it on IRC earlier, but after changing the constructor to 'explicit' and going through and fixing all of the build failures this caused, I'm starting to come to the conclusion that it's a bad idea.

for example, consider the following code:
    bool ensure_buffer_is_in_utf8 (const UString &a_path,
                                   const std::string &a_input,
                                   UString &a_output,
                                   std::string &a_current_charset)
    {
        LOG_FUNCTION_SCOPE_NORMAL_DD;
        NEMIVER_TRY

        UString buf_content;
        if (is_buffer_valid_utf8 (a_input.c_str (), a_input.size ())) {
            a_output = a_input; /// << this line no longer compiles
            return true;
        }

Notice the subtle bug that we uncovered by making the constructor explicit. Previously, the line 'a_output = a_input;' was able to compile because presumably a_input was implicitly converted from a std::string to a UString. But remember that constructor does a locale conversion. So the logic of this code was essentially: "if the input is valid utf8, convert it from the current locale to utf8". This is quite obviously wrong, but notice that there's really no elegant way to fix this code. There's no direct way to convert a std::string to the desired UString output type without doing a locale conversion. You could do something like this, but it's rather ugly:
    'a_output = UString(a_input.c_str());'
If somebody that didn't know the internals of the UString(std::string) and UString(const char*) constructors looked at this code, they would wonder why we added the c_str() call there, and it's quite possible that they would just remove that while doing some refactoring, and introduce a bug accidentally.

So I'm currently leaning quite strongly toward the idea that doing a conversion here is a bad idea. But at the same time, i think it would probably be a rather large effort to change it at this point...

Dodji, any thoughts?


OK, so dodji and I talked about this on IRC again a bit and I believe we both agreed on removing the automatic conversion. So I have made this change and made the constructor explicit, made the changes to LogStream mentioned above, and updated all code to compile properly with these changes. Since the change is fairly invasive, I'd like to have some review of it before pushing it to the repository, so I've put it up in my user repository again:

git remote add jonner http://www.gnome.org/~jjongsma/git/nemiver.git/
git remote update
look at branch jonner/ustring-cleanup

This should speed up a lot of long parsing performance issues, so I'd like to get this in soon. If you'd prefer I can just push it directly without review, but it seems safer to have at least a little review.

--
jonner


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]