Re: Unicode and C++



Nathan Myers <ncm@cantrip.org> writes:

> The C++ committee deliberately chose to design its Standard Library so 
> that large characters are preferentially stored and operated on as wide
> characters, and streamed in and out of the system in an 8 bit encoding
> where appropriate, converting automatically at the buffer level.
> 
> Manipulating UTF-8 in memory is pathetic.  UTF-8 is compact and 
> convenient as a network and file format representation, but it sucks 
> rocks for string manipulations, or in general for in-memory operations.  
> Things that are naturally O(1) become O(n) for no reason better than 
> sheer obstinacy and stubbornness.

Almost no text algorithms require random acess within strings; I've
done quite a bit of code review to see how people use strings and how
the code will have to be converted to UTF-8, and this is simply not a
problem.

People almost never care about character 20. And if they thing that
character 20 in the string is going to correspond to a field width of
20 on the screen, then they are badly mistaken.

I can't say it _never_ matters. There is one common operation that
_does_ becomes O(n) - and that is converting from

 iterator within wide string 

to: 

 iterator within multibyte string

it matters if people are using wide strings and multibyte strings
simultaneously.

I'm not an unreasoning bigot for UTF-8; there are other reasons why it
isn't always a great thing. (I've written quite a pile of UTF-8 code
by now, so I think I'm in a reasonable position to judge.)

 - There is an efficiency hit for working in UTF-8. Not
   so much from extra O(n) steps as simply from the extra
   overhead from iterating through strings. This isn't
   a huge overhead, but it probably slows down Pango
   by about 5%.

 - There is an extra conceptual load on the programmer when
   manipulating UTF-8 in C. 

   while (*p)
     {
       if (*p == wc)
         n++;
       p++;
     }

   Is undoubtably a bit simpler for an experienced C programmer
   than:

   while (*p)
     {
       if (g_utf8_get_char (p) == wc)
         n++;
       p = g_utf8_next_char (p);
     }

Inside something like Pango both of these are somewhat of a concern
for me, and the first more than the second. 

For people using gtk2, if I have any concerns at all about UTF-8, they
are _completely_ the second concern. In general, keeping a toolkit
easy to use and the code people write maintainable is much more
important than any 5% performance differences.

>From this perspective, using UTF-8 is a mixed blessing.

 - It complicates string handling a bit, as described above.
   But most code that does this is "broken" in other ways
   for internationalization.

However:

 - Most code translates directly to UTF-8 with no changes at
   all. (Most code does not dive into strings, or if it does, does so
   in ways that are safe for UTF-8. Think about how much code written
   for iso-8859-1 is used for EUC-JP, without bad effects.)
   This is a lot better than changing every function to have _w after
   it.

   Of course, IIRC, Microsoft simply changed their functions
   to have different interfaces when you #define UNICODE. 
   I hope nobody thinks that is a good idea.

 - Relying on the C library to have wide character equivalents to
   things like sprintf() is just not going to work for now.  GNU
   libc-2.2 will have such functions, but right now, the fraction of
   the GTK+ user base on platforms with such functions is tiny. And
   not to mention that fact that wchar_t is 16-bits in quite a few
   places... So UTF-8 is going to be nicer for people trying to use
   the systems features. Including gettext().

To me, making UTF-8 the standard string type for GTK+ is
almost incontestably a good thing. Anything else is going
to be a nightmare for porting to gtk2 and for portability.
 
> How can we salvage something from the mess?  For people who insist
> on keeping UTF-8 in RAM and passing null-terminated strings of them
> around, we can do several things.  The best approaches avoid
> institutionalizing UTF-8 except as a low-level interchange format.
> 
> Ideally, we would plan to add wide-character interfaces to the 
> GTK/GNOME components.  A new-generation component system does nobody 
> any favors by forcing them to stick with using 8-bit chars to hold 
> things that are intrinsically bigger.  Whatever we do should be able, 
> more or less automatically, to take advantage of wide-character
> interfaces in GNOME as they are implemented.  (It's disgraceful that 
> it's not the default already.)

I don't think doubling all text entry points in GTK+ would be a good
thing. In almost all cases, the conversion overhead simply isn't
significant for a UI library, so we might as well keep things simple
and not add a slew of extra entry points.

It's conceivably possible that in the few cases where conversion
could _possibly_ be an overhead, like the Text widget, we might
want to have dual interfaces. But even there, if you want
to do that and keep the code simple, you are just moving the
conversion overhead inside of GTK+.

Now, Pango is a somewhat different issue. It has fewer entry points
that take text, and it handles text in a much more intensive way. At
this point I could see some point in making Pango use UCS-4
internally, and providing dual entry points. (I'd like to see Pango
used outside of GTK+, and providing UCS-4 interfaces along with the
UTF-8 ones interfaces might help in this. Not that converting between
UCS-4 and UTF-16 with surrogates is significantly nicer than
converting between UTF-8 and UTF-16 with surrogates. In either case,
converting an index is a O(n) operation.)

I very much doubt I'll have a chance to do this for Pango-1.0. 

Regards,
                                        Owen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]