Re: Glib::ustring's operator<< doing a conversion to locale, why?



Hey _again_ (yes again, sorry),

We're working on an ustring derived class currently (in testing) to work around this issue (we know it's intended as a final class, yet we see no reasonable way to fix this, except to have Post-Its with "use ::raw()!" everywhere taped to our screens), and just came up with an idea.

Basically, an additional member function of ustring, ::set_output_type() or ::set_flags(), that would modify a static member variable (so this would pertain to all ustrings used), and with which it would be possible to toggle the behaviour into a different one (in our plea the named one, to just use os << raw()). Overhead incurred would be 1 static member variable, and 1 use of switch() in operator<<().

Would this be an acceptable code change?

-- Milosz

On 5/5/07, Daniel Elstner <daniel kitta googlemail com > wrote:
Am Samstag, den 05.05.2007, 17:04 +0200 schrieb Milosz Derezynski:
> A small follow up to the previous mail wrt to "i see no reason": An
> std::string can hold text in a different code than the current locale
> as well. ustring makes the assumption that since it always holds
> UTF-8, and i think can not even sanely hold anything else (save for
> valid subsets of UTF-8), it is fully rational to perform this
> conversion in the operators, but the part that bugs me is that it uses
> LANG or LC_* or LOCALE (etc, as stated) as basis for the conversion,
> for which there is no reason to believe that people actually always
> want this.
>
> If it would use the current global C++ locale (if there can be a
> global locale setting, sorry for my newbishness again), it would be
> all right really, but this way, it's just beyond odd.
>
> Daniel can you maybe shed some light on this please?

Yes, I agree that it was basically a mistake to make operator<<()
convert to the locale encoding.  I implemented this at a time when GCC's
libstdc++ didn't support the C++ locale scheme and the global C locale
was always used.  Now I find myself writing .raw() all the time.

As you write above, doing the conversion is not entirely unreasonable
though, since ustring always uses UTF-8 and C++ streams may use
different encodings.  The problem is, though, that the intended
facilities for stream codeset conversion -- that is, codecvt -- are next
to useless.  There's scarce documentation on the subject but from what I
gathered there's no public interface to get the name of the encoding
used by a stream.

It is quite obvious that the C++ standard library API simply wasn't
designed for using multi-byte encodings internally.  The code conversion
facilities of streams seem to exist mainly for conversion between wide
characters (internal) and multi-byte (external) when using wide streams.

By the way, my patch adding the compose() and format() features to
glibmm also introduces operator<<() and operator>>() conversions to
std::wostream and from std::wistream, respectively:

     http://bugzilla.gnome.org/show_bug.cgi?id=399216

These conversions are actually sensible to do, and even independent of
the locale on many (most?) systems -- at least on modern glibc systems
(always UCS-4) and Windows (always UTF-16).

>         > I think I remember someone (Daniel Elstner?) mentioning that
>         they seemed
>         > like a good idea at the time but turned out to be a mistake
>         which it is
>         > too late to change.

Indeed.  I think I said this in some bugzilla comment.

--Daniel





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]