Re: Glib::ustring's operator<< doing a conversion to locale, why?



On Fri, 2007-05-04 at 12:41 +0200, Murray Cumming wrote:
> On Fri, 2007-05-04 at 12:26 +0200, Milosz Derezynski wrote:
> > Hey Murray,
> > 
> > Well the thing is that with UTF-8, you basically remain C string
> > compatible (there are no zero/terminators in the middle of the
> > string), which essentially means that you only have a sequence of
> > bytes and a terminating zero, which can always and at any time be
> > stuffed into an std::string (not wstring btw, and this is not the same
> > problem as "C++ doesn't provide any kind of UTF-8 string data type"),
> > and get the same sequence back out. 
> > 
> > Since you can store a C string into an std::string, and get the exact
> > same result back out using std::string::c_str(), you can store an
> > UTF-8 string unchanged into an std::string (or pipe into an ostream,
> > etc).
> 
> I guess that ostream has different constraints than std::string. For
> instance, ostream can convert numbers to text representations and
> vice-versa and do formatting. It can't do that for UTF-8 strings,
> because there is no UTF-8 support in standard C++.
> 
> I don't know whether this is the main problem with ostream and UTF-8.
> 
> If you are sure that the use of ostream should work then I guess you can
> just use give the result of ustring::raw() to the ostream, as suggested
> in the documentation.

The use of ustring::raw() to suppress codeset conversion by the
insertion and extraction operators will definitely work.  In the absence
of the codeset facet being set, the only conversions the ostream is
entitled to make (and then only if the binary flag is not set) is the
end of line marker ('\n'), which is not UTF-8 dependent.

The only problem with using ostreams with UTF-8 is in relation to field
width output formatting, which will be set in bytes rather than
characters with a standard ostream imbued with the C locale.  You
mention numbers, but they are not a problem as in ASCII/UTF-8 they have
a single byte representation.

Codeset conversion should really be left to codeset facets rather than
the extraction and insertion operators, which are ignorant of the state
of the binary flag and whatever locales and facets have been imbued into
the stream.  They ignore one of the features of C++ streams which
differentiate them from C streams, which is that different stream
objects can have different locales imbued in them.

I think I remember someone (Daniel Elstner?) mentioning that they seemed
like a good idea at the time but turned out to be a mistake which it is
too late to change.

Chris





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]