Re: Glib::ustring's operator<< doing a conversion to locale, why?



On Fri, 2007-05-04 at 21:50 +0100, Chris Vine wrote:
> On Fri, 2007-05-04 at 12:41 +0200, Murray Cumming wrote:
> > On Fri, 2007-05-04 at 12:26 +0200, Milosz Derezynski wrote:
> > > Hey Murray,
> > > 
> > > Well the thing is that with UTF-8, you basically remain C string
> > > compatible (there are no zero/terminators in the middle of the
> > > string), which essentially means that you only have a sequence of
> > > bytes and a terminating zero, which can always and at any time be
> > > stuffed into an std::string (not wstring btw, and this is not the same
> > > problem as "C++ doesn't provide any kind of UTF-8 string data type"),
> > > and get the same sequence back out. 
> > > 
> > > Since you can store a C string into an std::string, and get the exact
> > > same result back out using std::string::c_str(), you can store an
> > > UTF-8 string unchanged into an std::string (or pipe into an ostream,
> > > etc).
> > 
> > I guess that ostream has different constraints than std::string. For
> > instance, ostream can convert numbers to text representations and
> > vice-versa and do formatting. It can't do that for UTF-8 strings,
> > because there is no UTF-8 support in standard C++.
> > 
> > I don't know whether this is the main problem with ostream and UTF-8.
> > 
> > If you are sure that the use of ostream should work then I guess you can
> > just use give the result of ustring::raw() to the ostream, as suggested
> > in the documentation.
> 
> The use of ustring::raw() to suppress codeset conversion by the
> insertion and extraction operators will definitely work.  In the absence
> of the codeset facet being set, the only conversions the ostream is
> entitled to make (and then only if the binary flag is not set) is the
> end of line marker ('\n'), which is not UTF-8 dependent.
> 
> The only problem with using ostreams with UTF-8 is in relation to field
> width output formatting, which will be set in bytes rather than
> characters with a standard ostream imbued with the C locale.  You
> mention numbers, but they are not a problem as in ASCII/UTF-8 they have
> a single byte representation.

English numbers maybe. Are you sure that no language uses more than one
byte in UTF-8 for any of its numbers or for decimal points or commas?

But again, I don't think this is the main issue. There's some more
obvious error that this is meant to prevent. I'm not sure what it is.

> Codeset conversion should really be left to codeset facets rather than
> the extraction and insertion operators, which are ignorant of the state
> of the binary flag and whatever locales and facets have been imbued into
> the stream.  They ignore one of the features of C++ streams which
> differentiate them from C streams, which is that different stream
> objects can have different locales imbued in them.
> 
> I think I remember someone (Daniel Elstner?) mentioning that they seemed
> like a good idea at the time but turned out to be a mistake which it is
> too late to change.

He's always seemed very sure about this being correct, I think.

-- 
Murray Cumming
murrayc murrayc com
www.murrayc.com
www.openismus.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]