Re: [Nautilus-list] Re: About wchar support.



Darin Adler <darin bentspoon com> writes:
> 
> You're lecturing to the wrong guy. The decision to use UTF-8 for GNOME
> 2 is and was made by the GTK team, who hang out on the gtk-devel-list.
> 
> If there are particular places where the conversion between UTF-8 and
> a uniform-width character encoding turns out to be a bottleneck, we
> can consider changing the program to keep things in the uniform-width
> coding and convert to UTF-8 only when needed. But I'd need actual
> performance data to make that change, not merely the theoretical idea
> that it would be faster. In anything but the most text-intensive
> programs I suspect the real bottlenecks would be elsewhere.
> 

Just to explain the decision to James.

The reason for using UTF-8 is that otherwise we would have had to
duplicate the entire library API (thousands of functions) with _wc
variants. gtk_label_set_text(), gtk_label_set_text_wc().  Then on top
of that string literals in C would not work.  Then on top of that we
would have to port millions of lines of code to wide chars. Combine
all these negatives, and it's just not feasible in any way.  With
UTF-8, most apps will Just Work without too much porting effort.

The disadvantages of UTF-8 exist, but they aren't that bad. Darin
suggests that a very text-intensive app might be slow, but even Pango
which spends tons of time doing text processing is slowed down maybe
5% by UTF-8 according to profiles. (5% = time spent in UTF-8
manipulation functions.) (Though Owen does want to move Pango
internals to wide chars, this is mostly because UTF-8 is sort of
annoying to iterate over when you are doing tons of text
processing. But it's a one-liner to convert from UTF-8 to wide chars
if you expect to process the text a lot, and then you convert back.)

Widechar Unicode is not actually much simpler than UTF-8. Initially it
appears that it allows you to assume that 1 string index is 1
char. But this isn't really helpful most of the time, once you
consider grapheme boundaries, combining characters, clusters, and all
that stuff. Initially it appears that widechar can't be invalid; but
it turns out that some integer values are invalid Unicode, and some
character sequences are invalid as well. So you still have to
validate. So either way you have to validate and you can't assume
chars == bytes.

And while UTF-8 initially sounds hard to deal with for string
algorithms; it turns out that nearly any string processing anyone does
either a) is totally broken from an i18n standpoint or b) assumes the
text is ASCII. So Pango-like algorithms are one of the few really
legitimate instances of Unicode string processing.

The C++ world likes wide chars because the std::ba_string interface
works on them. But again, nearly any use of std::ba_string I've seen
either assumes ASCII or is broken for Unicode. And any use of
iostreams for text formatting is also assuming ASCII or broken.

Finally, all our existing legacy codebase is either broken or makes no
assumptions about encoding whatsoever, because currently strings can
be in any one of countless encodings. So UTF-8 actually makes it
easier to handle strings, not harder. Now you can at least iterate
over strings by char since you do know the encoding. Before you had no
idea what the encoding was.

Havoc








[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]