Unicode and C++
- From: Nathan Myers <ncm cantrip org>
- To: hp redhat com
- Cc: libstdc++ sourceware cygnus com, gtk-i18n-list gnome org,otaylor redhat com
- Subject: Unicode and C++
- Date: Sat, 1 Jul 2000 01:26:56 -0700
Havoc Pennington wrote:
> Anyway, basically I need to convert to and from UTF-8 a
> _lot_ if it isn't used natively. Typically we have:
>
> class Label
> {
> utf_string text () const;
> void set_text (const utf_string & str);
> };
>
> utf_string
> Label::text () const
> {
> // gtk_label()->text is a nul-terminated UTF-8 string
> return utf_string (gtk_label->text);
> }
>
> void
> Label::set_text (utf_string str) const
> {
> // c_str() should return UTF-8
> gtk_label_set_text (str.c_str ());
> }
>
> Converting to/from wide chars seems quite slow and wasteful, the
> emerging free software standard is UTF-8 (Pango, Python, Perl, GTK,
> glibc all use it mostly I think...)
The C++ committee deliberately chose to design its Standard Library so
that large characters are preferentially stored and operated on as wide
characters, and streamed in and out of the system in an 8 bit encoding
where appropriate, converting automatically at the buffer level.
Manipulating UTF-8 in memory is pathetic. UTF-8 is compact and
convenient as a network and file format representation, but it sucks
rocks for string manipulations, or in general for in-memory operations.
Things that are naturally O(1) become O(n) for no reason better than
sheer obstinacy and stubbornness.
How can we salvage something from the mess? For people who insist on
keeping UTF-8 in RAM and passing null-terminated strings of them around,
we can do several things. The best approaches avoid institutionalizing
UTF-8 except as a low-level interchange format.
Ideally, we would plan to add wide-character interfaces to the
GTK/GNOME components. A new-generation component system does nobody
any favors by forcing them to stick with using 8-bit chars to hold
things that are intrinsically bigger. Whatever we do should be able,
more or less automatically, to take advantage of wide-character
interfaces in GNOME as they are implemented. (It's disgraceful that
it's not the default already.)
> So, some possible solutions:
> a) convert to/from a string of 16-bit UCS2 wide chars
16-bit UCS2 is a crock. Microsoft and Sun/Java were idiotic to
have fallen for this half measure. We in the Free Software world
needn't be so foolish. On all interesting architectures and
environments, wchar_t is 32 bits by default, and that's a sensible
size for a modern character.
> b) just use std::string, and require users to handle
> iteration by character on their own (standard library
> operations would operate on bytes)
That is the status quo. It's unsatisfactory, or we wouldn't
be discussing this.
> c) subclass std::string and add methods and an iterator
> type that operate on characters, leaving the default
> set of methods and iterators operating on bytes
std::string is a "concrete" class, and does not "subclass"
(i.e. support derivation from) safely or well. Fortunately,
there is little reason to do such a thing.
> d) write a class with the same interface as a) (wide chars)
> but using an internal UTF-8 string as the representation,
> in order to have a fast .c_str() that returns UTF-8
This would be unsatisfactory in any number of ways.
All is not lost, though. As I noted, there are other approaches.
> Some pros and cons from my perspective:
> a) + the string has the proper complexity guarantees for algorithms
> + is easy to implement using existing libstdc++ string code
> I assume
> + this is probably the standard thing on Windows
>
> - expensive, due to copies to/from utf8
> - difficult to interoperate with all the C code using UTF-8 or
> normal strings; unclear if .c_str() even makes sense
> - possible that we need UCS4, ugh (I don't know though)
Agree on these, except that UCS4 is the right thing.
> b) + simple
> - sucks
Agreed.
> c) + allows easy interoperation with both C code and old C++ code
> that uses std::string
> + can add methods that maybe make sense for UTF-8 but aren't
> logical for a fixed-size-character string
>
> - you have a bunch of renamed stuff, like ::char_iterator,
> char_find, char_find_first_of, char_rfind, etc.
> - operator[] operates on bytes
> - I'm not sure the implementation works out properly in practice
> - assignment is O(n), for example - complexity guarantees broken
Agreed. Derivation is an answer to the wrong question.
> d) + an efficiency hack that maybe solves the .c_str() problem
>
> - it damages efficiency for e.g. operator[], so the efficiency
> hack is a tradeoff rather than a win
> - still interoperates poorly with old C++ code using std::string
>
> So, I dunno. The above notes are made up on-the-fly, I'm sure they
> aren't comprehensive. Discuss. ;-)
Right. There are better alternatives.
The first helpful thing to observe is that there is no reason that
functions which operate on strings (of whatever sort) have to be
members of a class. In fact, we would have been better off if all
the members of class string that _could_ be non-members _were_.
(The enormous list of string member is the biggest design error in
the standard string class.)
If you think of std::string as just a transport mechanism, there is
no reason to change it for use in transporting UTF-8; just ignore
the members that assume a character is one byte. To operate on a
string that you happen to know contains UTF-8 characters, just
write and call functions that assume it contains UTF-8 sequences.
For cases where you want an efficient addressable container object
(e.g. for operator[]()), you can make an object that keeps both
representations. Flags indicate that the char[] or wchar_t[] form
has been invalidated, and must be (lazily) regenerated after mutative
operations on the other form. Then conversions happen invisibly and
only as necessary.
The following is just a sketch.
class Unicode_string
{
// constructors
explicit Unicode_string(char const* p)
: narrow(p), wide(), flags(narrow_ok) {}
explicit Unicode_string(std::string const& s)
: narrow(s), wide(), flags(narrow_ok) {}
explicit Unicode_string(std::wstring const& s)
: narrow(), wide(s), flags(wide_ok) {}
// conversions
operator std::wstring const&()
{ this->widen(); return this->wide; }
operator std::string const&()
{ this->narrowen(); return this->narrow; }
char const* c_str()
{ ((std::string const&)(*this)).c_str(); }
// utility operations
wchar_t& operator[](size_t i)
{ this->widen(); this->flags &= ~narrow_ok; return this->wide[i]; }
wchar_t const& operator[](size_t i) const
{ this->widen(); return this->wide[i]; }
bool equal(Unicode_string const& s);
bool less(Unicode_string const& s);
private:
void widen() { if (!(this->flags & wide_ok)) this->make_wide(); }
void narrowen() { if (!(this->flags & narrow_ok)) this->make_narrow(); }
void make_wide(); // do UTF-8 to UCS4 conversion
void make_narrow(); // do UCS4 to UTF-8 conversion
// data members
std::string narrow;
std::wstring wide;
enum { neither_ok = 0, narrow_ok = 0x1, wide_ok = 0x2, both_ok = 0x3 };
char flags;
};
inline bool
operator==(Unicode_string const& a, Unicode_string const& b)
{ return a.equal(b); }
bool
Unicode_string::equal(Unicode_string const& s)
{
switch (this->flags & s.flags)
{
case Unicode_string::none:
return (std::wstring const&)(*this) == (std::wstring const&)(a);
case Unicode_string::narrow_ok:
return this->narrow == s.narrow;
case Unicode_string::wide_ok:
case Unicode_string::both_ok:
return this->wide == s.wide;
}
}
The advantage of this approach is that when Pango implements
wide-character interfaces (as it had better, someday soon) you don't
have to change your code much. If you never operate on the string's
individual characters, or otherwise treat it as a wide string, the
conversion never happens. Likewise, if you construct it as a wide
string and never treat it as bytes, that conversion never happens.
Nathan Myers
ncm at cantrip dot org
[Date Prev][
Date Next] [Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]