Unicode and C++



Havoc Pennington wrote:
> Anyway, basically I need to convert to and from UTF-8 a 
> _lot_ if it isn't used natively. Typically we have:
> 
> class Label
> {
>   utf_string text () const;
>   void set_text (const utf_string & str);
> };
> 
> utf_string
> Label::text () const
> {
>   // gtk_label()->text is a nul-terminated UTF-8 string
>   return utf_string (gtk_label->text);
> }
> 
> void
> Label::set_text (utf_string str) const
> {
>   // c_str() should return UTF-8
>   gtk_label_set_text (str.c_str ());
> }
> 
> Converting to/from wide chars seems quite slow and wasteful, the
> emerging free software standard is UTF-8 (Pango, Python, Perl, GTK,
> glibc all use it mostly I think...)

The C++ committee deliberately chose to design its Standard Library so 
that large characters are preferentially stored and operated on as wide
characters, and streamed in and out of the system in an 8 bit encoding
where appropriate, converting automatically at the buffer level.

Manipulating UTF-8 in memory is pathetic.  UTF-8 is compact and 
convenient as a network and file format representation, but it sucks 
rocks for string manipulations, or in general for in-memory operations.  
Things that are naturally O(1) become O(n) for no reason better than 
sheer obstinacy and stubbornness.

How can we salvage something from the mess?  For people who insist on 
keeping UTF-8 in RAM and passing null-terminated strings of them around, 
we can do several things.  The best approaches avoid institutionalizing
UTF-8 except as a low-level interchange format.

Ideally, we would plan to add wide-character interfaces to the 
GTK/GNOME components.  A new-generation component system does nobody 
any favors by forcing them to stick with using 8-bit chars to hold 
things that are intrinsically bigger.  Whatever we do should be able, 
more or less automatically, to take advantage of wide-character
interfaces in GNOME as they are implemented.  (It's disgraceful that 
it's not the default already.)

> So, some possible solutions:
>  a) convert to/from a string of 16-bit UCS2 wide chars

16-bit UCS2 is a crock.  Microsoft and Sun/Java were idiotic to 
have fallen for this half measure.  We in the Free Software world 
needn't be so foolish.  On all interesting architectures and 
environments, wchar_t is 32 bits by default, and that's a sensible 
size for a modern character.

>  b) just use std::string, and require users to handle
>     iteration by character on their own (standard library
>     operations would operate on bytes)

That is the status quo.  It's unsatisfactory, or we wouldn't
be discussing this.

>  c) subclass std::string and add methods and an iterator
>     type that operate on characters, leaving the default
>     set of methods and iterators operating on bytes
 
std::string is a "concrete" class, and does not "subclass"
(i.e. support derivation from) safely or well.  Fortunately, 
there is little reason to do such a thing.
 
>  d) write a class with the same interface as a) (wide chars)
>     but using an internal UTF-8 string as the representation,
>     in order to have a fast .c_str() that returns UTF-8

This would be unsatisfactory in any number of ways.

All is not lost, though.  As I noted, there are other approaches.

> Some pros and cons from my perspective:
>  a) + the string has the proper complexity guarantees for algorithms
>     + is easy to implement using existing libstdc++ string code
>       I assume
>     + this is probably the standard thing on Windows
> 
>     - expensive, due to copies to/from utf8
>     - difficult to interoperate with all the C code using UTF-8 or
>       normal strings; unclear if .c_str() even makes sense
>     - possible that we need UCS4, ugh (I don't know though)

Agree on these, except that UCS4 is the right thing.

>  b) + simple
>     - sucks

Agreed.

>  c) + allows easy interoperation with both C code and old C++ code
>       that uses std::string
>     + can add methods that maybe make sense for UTF-8 but aren't
>       logical for a fixed-size-character string
> 
>     - you have a bunch of renamed stuff, like ::char_iterator,
>       char_find, char_find_first_of, char_rfind, etc.
>     - operator[] operates on bytes
>     - I'm not sure the implementation works out properly in practice
>     - assignment is O(n), for example - complexity guarantees broken

Agreed.  Derivation is an answer to the wrong question.

>  d) + an efficiency hack that maybe solves the .c_str() problem
> 
>     - it damages efficiency for e.g. operator[], so the efficiency
>       hack is a tradeoff rather than a win
>     - still interoperates poorly with old C++ code using std::string
> 
> So, I dunno. The above notes are made up on-the-fly, I'm sure they
> aren't comprehensive. Discuss. ;-)

Right.  There are better alternatives.

The first helpful thing to observe is that there is no reason that 
functions which operate on strings (of whatever sort) have to be 
members of a class.  In fact, we would have been better off if all 
the members of class string that _could_ be non-members _were_.  
(The enormous list of string member is the biggest design error in 
the standard string class.)

If you think of std::string as just a transport mechanism, there is 
no reason to change it for use in transporting UTF-8; just ignore 
the members that assume a character is one byte.  To operate on a 
string that you happen to know contains UTF-8 characters, just 
write and call functions that assume it contains UTF-8 sequences.

For cases where you want an efficient addressable container object 
(e.g. for operator[]()), you can make an object that keeps both 
representations.  Flags indicate that the char[] or wchar_t[] form 
has been invalidated, and must be (lazily) regenerated after mutative 
operations on the other form.  Then conversions happen invisibly and 
only as necessary.  

The following is just a sketch.

  class Unicode_string
  {
    // constructors
    explicit Unicode_string(char const* p)
      : narrow(p), wide(), flags(narrow_ok) {}
    explicit Unicode_string(std::string const& s)
      : narrow(s), wide(), flags(narrow_ok) {}
    explicit Unicode_string(std::wstring const& s)
      : narrow(), wide(s), flags(wide_ok) {}

    // conversions
    operator std::wstring const&() 
      { this->widen(); return this->wide; }
    operator std::string const&() 
      { this->narrowen(); return this->narrow; }
    char const* c_str() 
      { ((std::string const&)(*this)).c_str(); }

    // utility operations
    wchar_t& operator[](size_t i)
      { this->widen(); this->flags &= ~narrow_ok; return this->wide[i]; }
    wchar_t const& operator[](size_t i) const
      { this->widen(); return this->wide[i]; }
    bool equal(Unicode_string const& s);
    bool less(Unicode_string const& s);

  private:
    void widen()    { if (!(this->flags & wide_ok)) this->make_wide(); }
    void narrowen() { if (!(this->flags & narrow_ok)) this->make_narrow(); }
    void make_wide();   // do UTF-8 to UCS4 conversion
    void make_narrow(); // do UCS4 to UTF-8 conversion

    // data members
    std::string narrow;
    std::wstring wide;
    enum { neither_ok = 0, narrow_ok = 0x1, wide_ok = 0x2, both_ok = 0x3 };
    char flags; 
  };

  inline bool 
  operator==(Unicode_string const& a, Unicode_string const& b)
    { return a.equal(b); }

  bool 
  Unicode_string::equal(Unicode_string const& s)
  {
    switch (this->flags & s.flags)
      {
    case Unicode_string::none: 
      return (std::wstring const&)(*this) == (std::wstring const&)(a); 
    case Unicode_string::narrow_ok: 
      return this->narrow == s.narrow; 
    case Unicode_string::wide_ok:
    case Unicode_string::both_ok:
      return this->wide == s.wide;
      }
  }

The advantage of this approach is that when Pango implements 
wide-character interfaces (as it had better, someday soon) you don't 
have to change your code much.  If you never operate on the string's 
individual characters, or otherwise treat it as a wide string, the 
conversion never happens.  Likewise, if you construct it as a wide 
string and never treat it as bytes, that conversion never happens.

Nathan Myers
ncm at cantrip dot org





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]