Re: Unicode and C++
- From: Nathan Myers <ncm nospam cantrip org>
- To: gtk-i18n-list gnome org, libstdc++ sourceware cygnus com
- Subject: Re: Unicode and C++
- Date: Mon, 3 Jul 2000 19:39:35 -0700
On Mon, Jul 03, 2000 at 11:59:24AM -0400, Havoc Pennington wrote:
> Nathan Myers <ncm@cantrip.org> writes:
> > Manipulating UTF-8 in memory is pathetic. UTF-8 is compact and
> > convenient as a network and file format representation, but it sucks
> > rocks for string manipulations, or in general for in-memory operations.
> > Things that are naturally O(1) become O(n) for no reason better than
> > sheer obstinacy and stubbornness.
>
> Or in the GTK+ case, massive quantities of legacy code that has to
> keep working. UTF8 is pretty easy to port to; UCS4 requires
> duplicating the whole API, then porting all apps to it. Without the
> nice C++ trick you've outlined here, it's also quite inefficient to
> use UCS4 internally but UTF8 in the interfaces.
The proper place for the C++ trick (or, rather, its C substitute) is in
the GTK+/GNOME library, so that legacy user code can pass in UTF-8 where
it must, and modern code can use a modern interface. Libraries should
optimize for the modern case.
> > Ideally, we would plan to add wide-character interfaces to the
> > GTK/GNOME components. A new-generation component system does nobody
> > any favors by forcing them to stick with using 8-bit chars to hold
> > things that are intrinsically bigger.
>
> Sadly (well, partially sadly), GTK+ isn't new generation, it already
> supports millions of lines of code.
It's a new generation nonetheless. It's not too late to do the wise,
forward-looking thing. The first step is to acknowledge that it's
the right thing, and announce that patches are welcome. The second
is to define backward-compatibility hacks to use in the patches to
make it convenient to continue to support the old interface.
> My Inti C++ wrapper is new generation however, so I can use your
> suggestion.
I will be happy to work with you on it.
> > For cases where you want an efficient addressable container object
> > (e.g. for operator[]()), you can make an object that keeps both
> > representations. Flags indicate that the char[] or wchar_t[] form
> > has been invalidated, and must be (lazily) regenerated after mutative
> > operations on the other form. Then conversions happen invisibly and
> > only as necessary.
>
> Excellent, this is the perfect solution.
>
> > The following is just a sketch.
> >
> > class Unicode_string
> > {
> > // constructors
> > explicit Unicode_string(char const* p)
> > : narrow(p), wide(), flags(narrow_ok) {}
> > explicit Unicode_string(std::string const& s)
> > : narrow(s), wide(), flags(narrow_ok) {}
> > explicit Unicode_string(std::wstring const& s)
> > : narrow(), wide(s), flags(wide_ok) {}
>
> If this string goes in libstdc++ as an extension, could it share the
> refcounted guts of std::string and std::wstring to avoid copies for
> these constructors (and for the conversion operators)?
> (I don't even know if you are using refcounting in the latest lib, but
> thought I'd ask.)
Since it stores "copies" of string and wstring instances, which are
refcounted, the reference-counting comes for free. Incidentally, this
is one of the few classes I know of where all the data members should
be marked "mutable". Probably it should be called "Utf8_string" rather
than "Unicode_string", because it's really specialized for conversions
to and from UTF-8, and not particularly for Unicode itself.
Nathan Myers
ncm at cantrip dot org
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]