Re: Glib::ustring's operator<< doing a conversion to locale, why?



On Fri, 2007-05-04 at 12:26 +0200, Milosz Derezynski wrote:
> Hey Murray,
> 
> Well the thing is that with UTF-8, you basically remain C string
> compatible (there are no zero/terminators in the middle of the
> string), which essentially means that you only have a sequence of
> bytes and a terminating zero, which can always and at any time be
> stuffed into an std::string (not wstring btw, and this is not the same
> problem as "C++ doesn't provide any kind of UTF-8 string data type"),
> and get the same sequence back out. 
> 
> Since you can store a C string into an std::string, and get the exact
> same result back out using std::string::c_str(), you can store an
> UTF-8 string unchanged into an std::string (or pipe into an ostream,
> etc).

I guess that ostream has different constraints than std::string. For
instance, ostream can convert numbers to text representations and
vice-versa and do formatting. It can't do that for UTF-8 strings,
because there is no UTF-8 support in standard C++.

I don't know whether this is the main problem with ostream and UTF-8.

If you are sure that the use of ostream should work then I guess you can
just use give the result of ustring::raw() to the ostream, as suggested
in the documentation.

> This conversion (that ustring performs) seems to be, for what my poor
> C++ knowledge is worth it, superfluous, and produces big problems atm
> for us. I've converted all of the relevant code to use a simple
> wrapped printf instead of boost::format, so the data goes unchanged,
> and i'm using ustring::c_str() now everywhere, but this clearly can't
> be it since boost is not exactly uncommon among C++ developers and
> boost::format is actually a very nice class. 
> 
> The only thing that doesn't work with UTF-8 and an std::string is
> accessing characters per index (and not bytes per index), since UTF-8
> characters can be multibyte, but again as not to confuse things up,
> this is not the "same" multibyte that e.g. std::wstring is, it just
> means "one character can encompass multiple bytes", but if you don't
> need to access individual characters, you could just as well use an
> std::string.
> 
> Wrt this, i believe that same as the following section states should
> go for operators << and >> (and even worse, ustring::operator>>
> assumes that the input data is in locale, which could be, or simply
> could just not be the case for whatever reasons): " Glib::ustring has
> implicit type conversions to and from std::string. These conversions
> do not convert to/from the current locale"
> 
> Also, as documented here:
> http://www.cplusplus.com/reference/iostream/ios/imbue.html , an
> ostream can be imbued with a locale (or already is from the start,
> presumably the system locale; speaking hypotheticall because i just
> don't know for certain), and performs a neccessary conversion itself.
> 
> Can someone please definitely state that this conversion that ustring
> performs is _really_ neccessary (because i fail to see why it is); and
> may it be only so i can rest well with having to use ustring::c_str()
> all the time :P 
> 
> Thanks
> Milosz
> 
> -----------------------------
> "'Cause if an actor acts in the forest and there's nobody there to see
> him...... Or something like that." -- Terry O'Ryan
> 
> On 5/4/07, Murray Cumming <murrayc murrayc com> wrote:
>         On Thu, 2007-05-03 at 23:54 +0200, Milosz Derezynski wrote:
>         > I found that Glib::ustring::operator<<() does a
>         locale-from-utf8
>         > conversion, always, and all the time. This totally, erm,
>         avoiding foul
>         > language, spoils up usage of e.g. boost::format properly, if
>         i want to
>         > have the UTF8-ness preserved.
>         
>         Yes, this is documented here:
>         http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details
>         
>         I believe that this must be done because the C++ ostream
>         cannot deal
>         with UTF-8. I don't know whether boost::format can deal with
>         UTF-8 - 
>         possibly not.
>         
>         People have complained often before, but the problems seems to
>         remain in
>         C++ ostream itself.
>         
>         > Using .c_str() or .raw() will avoid this problem, but this
>         is hardly a
>         > solution (now i'd have to audit _all_ of our code). Adding
>         to this 
>         > comes that an ostream will probably perform a conversion to
>         whatever
>         > locale it's imbued with anyway, so why this conversion
>         there?
>         >
>         > It seems to be just simply and plain flawed to me, unless
>         i'm totally 
>         > wrong, in which case i'd be glad to accept a justification
>         of the
>         > issue.
>         >
>         > -- Milosz
>         > "'Cause if an actor acts in the forest and there's nobody
>         there to see
>         > him...... Or something like that." -- Terry O'Ryan 
>         
>         --
>         Murray Cumming
>         murrayc murrayc com
>         www.murrayc.com
>         www.openismus.com
>         
> 
-- 
Murray Cumming
murrayc murrayc com
www.murrayc.com
www.openismus.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]