Re: Glib::ustring's operator<< doing a conversion to locale, why?
- From: Murray Cumming <murrayc murrayc com>
- To: Milosz Derezynski <internalerror gmail com>
- Cc: Gtkmm Mailing List <gtkmm-list gnome org>
- Subject: Re: Glib::ustring's operator<< doing a conversion to locale, why?
- Date: Fri, 04 May 2007 12:41:13 +0200
On Fri, 2007-05-04 at 12:26 +0200, Milosz Derezynski wrote:
> Hey Murray,
>
> Well the thing is that with UTF-8, you basically remain C string
> compatible (there are no zero/terminators in the middle of the
> string), which essentially means that you only have a sequence of
> bytes and a terminating zero, which can always and at any time be
> stuffed into an std::string (not wstring btw, and this is not the same
> problem as "C++ doesn't provide any kind of UTF-8 string data type"),
> and get the same sequence back out.
>
> Since you can store a C string into an std::string, and get the exact
> same result back out using std::string::c_str(), you can store an
> UTF-8 string unchanged into an std::string (or pipe into an ostream,
> etc).
I guess that ostream has different constraints than std::string. For
instance, ostream can convert numbers to text representations and
vice-versa and do formatting. It can't do that for UTF-8 strings,
because there is no UTF-8 support in standard C++.
I don't know whether this is the main problem with ostream and UTF-8.
If you are sure that the use of ostream should work then I guess you can
just use give the result of ustring::raw() to the ostream, as suggested
in the documentation.
> This conversion (that ustring performs) seems to be, for what my poor
> C++ knowledge is worth it, superfluous, and produces big problems atm
> for us. I've converted all of the relevant code to use a simple
> wrapped printf instead of boost::format, so the data goes unchanged,
> and i'm using ustring::c_str() now everywhere, but this clearly can't
> be it since boost is not exactly uncommon among C++ developers and
> boost::format is actually a very nice class.
>
> The only thing that doesn't work with UTF-8 and an std::string is
> accessing characters per index (and not bytes per index), since UTF-8
> characters can be multibyte, but again as not to confuse things up,
> this is not the "same" multibyte that e.g. std::wstring is, it just
> means "one character can encompass multiple bytes", but if you don't
> need to access individual characters, you could just as well use an
> std::string.
>
> Wrt this, i believe that same as the following section states should
> go for operators << and >> (and even worse, ustring::operator>>
> assumes that the input data is in locale, which could be, or simply
> could just not be the case for whatever reasons): " Glib::ustring has
> implicit type conversions to and from std::string. These conversions
> do not convert to/from the current locale"
>
> Also, as documented here:
> http://www.cplusplus.com/reference/iostream/ios/imbue.html , an
> ostream can be imbued with a locale (or already is from the start,
> presumably the system locale; speaking hypotheticall because i just
> don't know for certain), and performs a neccessary conversion itself.
>
> Can someone please definitely state that this conversion that ustring
> performs is _really_ neccessary (because i fail to see why it is); and
> may it be only so i can rest well with having to use ustring::c_str()
> all the time :P
>
> Thanks
> Milosz
>
> -----------------------------
> "'Cause if an actor acts in the forest and there's nobody there to see
> him...... Or something like that." -- Terry O'Ryan
>
> On 5/4/07, Murray Cumming <murrayc murrayc com> wrote:
> On Thu, 2007-05-03 at 23:54 +0200, Milosz Derezynski wrote:
> > I found that Glib::ustring::operator<<() does a
> locale-from-utf8
> > conversion, always, and all the time. This totally, erm,
> avoiding foul
> > language, spoils up usage of e.g. boost::format properly, if
> i want to
> > have the UTF8-ness preserved.
>
> Yes, this is documented here:
> http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details
>
> I believe that this must be done because the C++ ostream
> cannot deal
> with UTF-8. I don't know whether boost::format can deal with
> UTF-8 -
> possibly not.
>
> People have complained often before, but the problems seems to
> remain in
> C++ ostream itself.
>
> > Using .c_str() or .raw() will avoid this problem, but this
> is hardly a
> > solution (now i'd have to audit _all_ of our code). Adding
> to this
> > comes that an ostream will probably perform a conversion to
> whatever
> > locale it's imbued with anyway, so why this conversion
> there?
> >
> > It seems to be just simply and plain flawed to me, unless
> i'm totally
> > wrong, in which case i'd be glad to accept a justification
> of the
> > issue.
> >
> > -- Milosz
> > "'Cause if an actor acts in the forest and there's nobody
> there to see
> > him...... Or something like that." -- Terry O'Ryan
>
> --
> Murray Cumming
> murrayc murrayc com
> www.murrayc.com
> www.openismus.com
>
>
--
Murray Cumming
murrayc murrayc com
www.murrayc.com
www.openismus.com
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]