Re: Glib::ustring's operator<< doing a conversion to locale, why?

From: "Milosz Derezynski" <internalerror gmail com>
To: "Murray Cumming" <murrayc murrayc com>
Cc: Chris Vine <chris cvine freeserve co uk>, Gtkmm Mailing List <gtkmm-list gnome org>
Subject: Re: Glib::ustring's operator<< doing a conversion to locale, why?
Date: Sat, 5 May 2007 17:01:32 +0200

Sorry for again dipping into this (sorry, 'coz i've already basically explained my point), but what i only meant is that regardless of what Daniel's reasons were for implementing the operators like this, is that they work differently from std::string's operators << and >> for which there initially seems to be no reason.

Basically without explaining too much again, what it runs down to is that it is a specific conversion from utf8 to _locale_ (and not the default locale each program starts with, which i believe is even per-standard the "C" locale), but the current locale as set by LANG or whichever is the higher ranking qualifier (LOCALE or LC_ALL, etc).

ustring should have no reason to believe that people need this data in the current locale out, nor should it assume that all input is in locale (with operator>>()), which is in our code actually not the real problem, but which i see as generally being the bigger case in terms of faultyness.

Instead some documentation comment about needing to convert to and from utf8 in the relevant way when needed shout be stated.

Clearly this would raise the bar a little for probably "most cases" when using ustring, and it would require some C++ skills wrt learning how to use the locale system maybe, and generally about streams and code conversions etc, but it would be the more right thing to do of which i am more and more convinced now.

Yes, this is basically a plea for rethinking the current implementation and removing the conversions. I know it would be a total API break within this part of the API, but still.

How many apps actually "rely" on this behaviour, and if they do so, then why?

Milosz

On 5/4/07, Murray Cumming <murrayc murrayc com> wrote:

On Fri, 2007-05-04 at 21:50 +0100, Chris Vine wrote:
> On Fri, 2007-05-04 at 12:41 +0200, Murray Cumming wrote:
> > On Fri, 2007-05-04 at 12:26 +0200, Milosz Derezynski wrote:
> > > Hey Murray,
> > >
> > > Well the thing is that with UTF-8, you basically remain C string
> > > compatible (there are no zero/terminators in the middle of the
> > > string), which essentially means that you only have a sequence of
> > > bytes and a terminating zero, which can always and at any time be
> > > stuffed into an std::string (not wstring btw, and this is not the same
> > > problem as "C++ doesn't provide any kind of UTF-8 string data type"),
> > > and get the same sequence back out.
> > >
> > > Since you can store a C string into an std::string, and get the exact
> > > same result back out using std::string::c_str(), you can store an
> > > UTF-8 string unchanged into an std::string (or pipe into an ostream,
> > > etc).
> >
> > I guess that ostream has different constraints than std::string. For
> > instance, ostream can convert numbers to text representations and
> > vice-versa and do formatting. It can't do that for UTF-8 strings,
> > because there is no UTF-8 support in standard C++.
> >
> > I don't know whether this is the main problem with ostream and UTF-8.
> >
> > If you are sure that the use of ostream should work then I guess you can
> > just use give the result of ustring::raw() to the ostream, as suggested
> > in the documentation.
>
> The use of ustring::raw() to suppress codeset conversion by the
> insertion and extraction operators will definitely work.  In the absence
> of the codeset facet being set, the only conversions the ostream is
> entitled to make (and then only if the binary flag is not set) is the
> end of line marker ('\n'), which is not UTF-8 dependent.
>
> The only problem with using ostreams with UTF-8 is in relation to field
> width output formatting, which will be set in bytes rather than
> characters with a standard ostream imbued with the C locale.  You
> mention numbers, but they are not a problem as in ASCII/UTF-8 they have
> a single byte representation.

English numbers maybe. Are you sure that no language uses more than one
byte in UTF-8 for any of its numbers or for decimal points or commas?

But again, I don't think this is the main issue. There's some more
obvious error that this is meant to prevent. I'm not sure what it is.

> Codeset conversion should really be left to codeset facets rather than
> the extraction and insertion operators, which are ignorant of the state
> of the binary flag and whatever locales and facets have been imbued into
> the stream.  They ignore one of the features of C++ streams which
> differentiate them from C streams, which is that different stream
> objects can have different locales imbued in them.
>
> I think I remember someone (Daniel Elstner?) mentioning that they seemed
> like a good idea at the time but turned out to be a mistake which it is
> too late to change.

He's always seemed very sure about this being correct, I think.

--
Murray Cumming
murrayc murrayc com
www.murrayc.com
www.openismus.com

References:
- Glib::ustring's operator<< doing a conversion to locale, why?
  - From: Milosz Derezynski
- Re: Glib::ustring's operator<< doing a conversion to locale, why?
  - From: Murray Cumming
- Re: Glib::ustring's operator<< doing a conversion to locale, why?
  - From: Milosz Derezynski
- Re: Glib::ustring's operator<< doing a conversion to locale, why?
  - From: Murray Cumming
- Re: Glib::ustring's operator<< doing a conversion to locale, why?
  - From: Chris Vine
- Re: Glib::ustring's operator<< doing a conversion to locale, why?
  - From: Murray Cumming

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]