Re: UString and locale conversion

From: Jonathon Jongsma <jonathon quotidian org>
To: nemiver list <nemiver-list gnome org>
Subject: Re: UString and locale conversion
Date: Sat, 11 Apr 2009 22:26:56 -0500

Jonathon Jongsma wrote:

So, I mentioned to Dodji on IRC today that I had uncovered a majorperformance issue with the gdbmi parsing. The core of the problem wasas follows:
        LOG_D ("getting out at char '"
               << (char)a_input.c_str ()[cur]
               << "', at offset '"
               << (int)cur
               << "' for text >>>"
               << m_priv->input.raw()
               << "<<<",
               GDBMI_PARSING_DOMAIN);
Now, this doesn't look especially suspicious, and in fact it looks likewe may have even added the input.raw() call because we thought thatwould *increase* performance, but it happens to be a huge performanceproblem. What happens is this:
LogStream only defines operator<<() for const char* and UString, sothere's no explicit overload for std::string. So when we try to log a astd::string, it implicitly converts it to a UString. So, we justconverted a UString to a std::string by calling raw(), and now we'reconverting it back. And if that wasn't bad enough, theUString(std::string) constructor actually does an automatic localeconversion (i.e. it calls Glib::locale_to_utf8()). Not only is thatwrong (the std::string we're passing to the constructor here is not inlocale encoding, it's in utf-8 encoding), but it also kills performance.
So after a brief discussion on IRC, we agreed that theUString(std::string) constructor should be marked explicit so this sortof thing doesn't happen accidentally. However, there is still thequestion of whether the Glib::UString(std::string) constructor shouldactually do locale conversion. I was a bit ambivalent on this pointwhen we talked about it on IRC earlier, but after changing theconstructor to 'explicit' and going through and fixing all of the buildfailures this caused, I'm starting to come to the conclusion that it's abad idea.
for example, consider the following code:
    bool ensure_buffer_is_in_utf8 (const UString &a_path,
                                   const std::string &a_input,
                                   UString &a_output,
                                   std::string &a_current_charset)
    {
        LOG_FUNCTION_SCOPE_NORMAL_DD;
        NEMIVER_TRY

        UString buf_content;
        if (is_buffer_valid_utf8 (a_input.c_str (), a_input.size ())) {
            a_output = a_input; /// << this line no longer compiles
            return true;
        }
Notice the subtle bug that we uncovered by making the constructorexplicit. Previously, the line 'a_output = a_input;' was able tocompile because presumably a_input was implicitly converted from astd::string to a UString. But remember that constructor does a localeconversion. So the logic of this code was essentially: "if the input isvalid utf8, convert it from the current locale to utf8". This is quiteobviously wrong, but notice that there's really no elegant way to fixthis code. There's no direct way to convert a std::string to thedesired UString output type without doing a locale conversion. Youcould do something like this, but it's rather ugly:
    'a_output = UString(a_input.c_str());'
If somebody that didn't know the internals of the UString(std::string)and UString(const char*) constructors looked at this code, they wouldwonder why we added the c_str() call there, and it's quite possible thatthey would just remove that while doing some refactoring, and introducea bug accidentally.
So I'm currently leaning quite strongly toward the idea that doing aconversion here is a bad idea. But at the same time, i think it wouldprobably be a rather large effort to change it at this point...
Dodji, any thoughts?

OK, so dodji and I talked about this on IRC again a bit and I believe we both agreed onremoving the automatic conversion. So I have made this change and made the constructorexplicit, made the changes to LogStream mentioned above, and updated all code to compileproperly with these changes. Since the change is fairly invasive, I'd like to have somereview of it before pushing it to the repository, so I've put it up in my user repositoryagain:


git remote add jonner http://www.gnome.org/~jjongsma/git/nemiver.git/
git remote update
look at branch jonner/ustring-cleanup

This should speed up a lot of long parsing performance issues, so I'd like to get this insoon. If you'd prefer I can just push it directly without review, but it seems safer tohave at least a little review.


--
jonner

Follow-Ups:
- Re: UString and locale conversion
  - From: Dodji Seketeli

References:
- UString and locale conversion
  - From: Jonathon Jongsma

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]