Re: Glib::ustring's operator<< doing a conversion to locale, why?



Hey Murray,

Well the thing is that with UTF-8, you basically remain C string compatible (there are no zero/terminators in the middle of the string), which essentially means that you only have a sequence of bytes and a terminating zero, which can always and at any time be stuffed into an std::string (not wstring btw, and this is not the same problem as "C++ doesn't provide any kind of UTF-8 string data type"), and get the same sequence back out.

Since you can store a C string into an std::string, and get the exact same result back out using std::string::c_str(), you can store an UTF-8 string unchanged into an std::string (or pipe into an ostream, etc).

This conversion (that ustring performs) seems to be, for what my poor C++ knowledge is worth it, superfluous, and produces big problems atm for us. I've converted all of the relevant code to use a simple wrapped printf instead of boost::format, so the data goes unchanged, and i'm using ustring::c_str() now everywhere, but this clearly can't be it since boost is not exactly uncommon among C++ developers and boost::format is actually a very nice class.

The only thing that doesn't work with UTF-8 and an std::string is accessing characters per index (and not bytes per index), since UTF-8 characters can be multibyte, but again as not to confuse things up, this is not the "same" multibyte that e.g. std::wstring is, it just means "one character can encompass multiple bytes", but if you don't need to access individual characters, you could just as well use an std::string.

Wrt this, i believe that same as the following section states should go for operators << and >> (and even worse, ustring::operator>> assumes that the input data is in locale, which could be, or simply could just not be the case for whatever reasons): " Glib::ustring has implicit type conversions to and from std::string. These conversions do not convert to/from the current locale"

Also, as documented here: http://www.cplusplus.com/reference/iostream/ios/imbue.html , an ostream can be imbued with a locale (or already is from the start, presumably the system locale; speaking hypotheticall because i just don't know for certain), and performs a neccessary conversion itself.

Can someone please definitely state that this conversion that ustring performs is _really_ neccessary (because i fail to see why it is); and may it be only so i can rest well with having to use ustring::c_str() all the time :P

Thanks
Milosz

-----------------------------
"'Cause if an actor acts in the forest and there's nobody there to see him...... Or something like that." -- Terry O'Ryan

On 5/4/07, Murray Cumming <murrayc murrayc com> wrote:
On Thu, 2007-05-03 at 23:54 +0200, Milosz Derezynski wrote:
> I found that Glib::ustring::operator<<() does a locale-from-utf8
> conversion, always, and all the time. This totally, erm, avoiding foul
> language, spoils up usage of e.g. boost::format properly, if i want to
> have the UTF8-ness preserved.

Yes, this is documented here:
http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details

I believe that this must be done because the C++ ostream cannot deal
with UTF-8. I don't know whether boost::format can deal with UTF-8 -
possibly not.

People have complained often before, but the problems seems to remain in
C++ ostream itself.

> Using .c_str() or .raw() will avoid this problem, but this is hardly a
> solution (now i'd have to audit _all_ of our code). Adding to this
> comes that an ostream will probably perform a conversion to whatever
> locale it's imbued with anyway, so why this conversion there?
>
> It seems to be just simply and plain flawed to me, unless i'm totally
> wrong, in which case i'd be glad to accept a justification of the
> issue.
>
> -- Milosz
> "'Cause if an actor acts in the forest and there's nobody there to see
> him...... Or something like that." -- Terry O'Ryan

--
Murray Cumming
murrayc murrayc com
www.murrayc.com
www.openismus.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]