Re: Glib::ustring's operator<< doing a conversion to locale, why?



A small follow up to the previous mail wrt to "i see no reason": An std::string can hold text in a different code than the current locale as well. ustring makes the assumption that since it always holds UTF-8, and i think can not even sanely hold anything else (save for valid subsets of UTF-8), it is fully rational to perform this conversion in the operators, but the part that bugs me is that it uses LANG or LC_* or LOCALE (etc, as stated) as basis for the conversion, for which there is no reason to believe that people actually always want this.

If it would use the current global C++ locale (if there can be a global locale setting, sorry for my newbishness again), it would be all right really, but this way, it's just beyond odd.

Daniel can you maybe shed some light on this please?

Milosz

On 5/4/07, Murray Cumming <murrayc murrayc com> wrote:
On Fri, 2007-05-04 at 21:50 +0100, Chris Vine wrote:
> On Fri, 2007-05-04 at 12:41 +0200, Murray Cumming wrote:
> > On Fri, 2007-05-04 at 12:26 +0200, Milosz Derezynski wrote:
> > > Hey Murray,
> > >
> > > Well the thing is that with UTF-8, you basically remain C string
> > > compatible (there are no zero/terminators in the middle of the
> > > string), which essentially means that you only have a sequence of
> > > bytes and a terminating zero, which can always and at any time be
> > > stuffed into an std::string (not wstring btw, and this is not the same
> > > problem as "C++ doesn't provide any kind of UTF-8 string data type"),
> > > and get the same sequence back out.
> > >
> > > Since you can store a C string into an std::string, and get the exact
> > > same result back out using std::string::c_str(), you can store an
> > > UTF-8 string unchanged into an std::string (or pipe into an ostream,
> > > etc).
> >
> > I guess that ostream has different constraints than std::string. For
> > instance, ostream can convert numbers to text representations and
> > vice-versa and do formatting. It can't do that for UTF-8 strings,
> > because there is no UTF-8 support in standard C++.
> >
> > I don't know whether this is the main problem with ostream and UTF-8.
> >
> > If you are sure that the use of ostream should work then I guess you can
> > just use give the result of ustring::raw() to the ostream, as suggested
> > in the documentation.
>
> The use of ustring::raw() to suppress codeset conversion by the
> insertion and extraction operators will definitely work.  In the absence
> of the codeset facet being set, the only conversions the ostream is
> entitled to make (and then only if the binary flag is not set) is the
> end of line marker ('\n'), which is not UTF-8 dependent.
>
> The only problem with using ostreams with UTF-8 is in relation to field
> width output formatting, which will be set in bytes rather than
> characters with a standard ostream imbued with the C locale.  You
> mention numbers, but they are not a problem as in ASCII/UTF-8 they have
> a single byte representation.

English numbers maybe. Are you sure that no language uses more than one
byte in UTF-8 for any of its numbers or for decimal points or commas?

But again, I don't think this is the main issue. There's some more
obvious error that this is meant to prevent. I'm not sure what it is.

> Codeset conversion should really be left to codeset facets rather than
> the extraction and insertion operators, which are ignorant of the state
> of the binary flag and whatever locales and facets have been imbued into
> the stream.  They ignore one of the features of C++ streams which
> differentiate them from C streams, which is that different stream
> objects can have different locales imbued in them.
>
> I think I remember someone (Daniel Elstner?) mentioning that they seemed
> like a good idea at the time but turned out to be a mistake which it is
> too late to change.

He's always seemed very sure about this being correct, I think.

--
Murray Cumming
murrayc murrayc com
www.murrayc.com
www.openismus.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]