Re: Unicode and C++



On Mon, Jul 03, 2000 at 11:59:24AM -0400, Havoc Pennington wrote:
> Nathan Myers <ncm@cantrip.org> writes: 
> > Manipulating UTF-8 in memory is pathetic.  UTF-8 is compact and 
> > convenient as a network and file format representation, but it sucks 
> > rocks for string manipulations, or in general for in-memory operations.  
> > Things that are naturally O(1) become O(n) for no reason better than 
> > sheer obstinacy and stubbornness.
> 
> Or in the GTK+ case, massive quantities of legacy code that has to
> keep working.  UTF8 is pretty easy to port to; UCS4 requires
> duplicating the whole API, then porting all apps to it. Without the
> nice C++ trick you've outlined here, it's also quite inefficient to
> use UCS4 internally but UTF8 in the interfaces.

The proper place for the C++ trick (or, rather, its C substitute) is in 
the GTK+/GNOME library, so that legacy user code can pass in UTF-8 where
it must, and modern code can use a modern interface.  Libraries should
optimize for the modern case.
  
> > Ideally, we would plan to add wide-character interfaces to the 
> > GTK/GNOME components.  A new-generation component system does nobody 
> > any favors by forcing them to stick with using 8-bit chars to hold 
> > things that are intrinsically bigger.
> 
> Sadly (well, partially sadly), GTK+ isn't new generation, it already
> supports millions of lines of code.

It's a new generation nonetheless.  It's not too late to do the wise,
forward-looking thing.  The first step is to acknowledge that it's 
the right thing, and announce that patches are welcome.  The second
is to define backward-compatibility hacks to use in the patches to 
make it convenient to continue to support the old interface.
 
> My Inti C++ wrapper is new generation however, so I can use your 
> suggestion.

I will be happy to work with you on it.
  
> > For cases where you want an efficient addressable container object 
> > (e.g. for operator[]()), you can make an object that keeps both 
> > representations.  Flags indicate that the char[] or wchar_t[] form 
> > has been invalidated, and must be (lazily) regenerated after mutative 
> > operations on the other form.  Then conversions happen invisibly and 
> > only as necessary.  
> 
> Excellent, this is the perfect solution.
> 
> > The following is just a sketch.
> > 
> >   class Unicode_string
> >   {
> >     // constructors
> >     explicit Unicode_string(char const* p)
> >       : narrow(p), wide(), flags(narrow_ok) {}
> >     explicit Unicode_string(std::string const& s)
> >       : narrow(s), wide(), flags(narrow_ok) {}
> >     explicit Unicode_string(std::wstring const& s)
> >       : narrow(), wide(s), flags(wide_ok) {}
> 
> If this string goes in libstdc++ as an extension, could it share the
> refcounted guts of std::string and std::wstring to avoid copies for
> these constructors (and for the conversion operators)?
> (I don't even know if you are using refcounting in the latest lib, but
> thought I'd ask.)

Since it stores "copies" of string and wstring instances, which are
refcounted, the reference-counting comes for free.  Incidentally, this 
is one of the few classes I know of where all the data members should 
be marked "mutable".  Probably it should be called "Utf8_string" rather 
than "Unicode_string", because it's really specialized for conversions 
to and from UTF-8, and not particularly for Unicode itself.

Nathan Myers
ncm at cantrip dot org





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]