Re: unicode string



On Thu, Jul 06, 2000 at 03:25:43AM -0400, Havoc Pennington wrote:
> 
> I'll try putting together an implementation of utf8_string, and send
> it along to see what you think. I'd appreciate any comments.
> 
> One question about the sketch you sent: your conversion from const
> char* is explicit, unlike std::string. Is there a rationale for that?

I don't think you would ever want this thing to be constructed
invisibly.  The point is debatable for regular strings, but for
this one it seems pretty clear.  Invisible conversions are a 
real hazard in the best of circumstances, and rarely justified.  

The operator conversions are sort-of-OK here because there really 
is (e.g.) a wstring in there, but even then there's the lifetime 
issue.  References into other objects are always perilous.
I'm still debating whether the operator conversions really should
just be ordinary named member functions.  If this is a temporary
conversion hack, maybe not.  But if it's a permanent part of the 
library, then probably so.  (Is anything ever really as temporary 
as it should be?)

Note that in any case no function should take one of these monsters 
as an argument.

Nathan Myers
ncm at cantrip dot org

---------

  class utf8_string
  {
    // constructors
    explicit utf8_string(char const* p)
      : m_narrow(p), m_wide(), m_flags(utf8_string::narrow_ok) {}
    explicit utf8_string(std::string const& s)
      : m_narrow(s), m_wide(), m_flags(utf8_string::narrow_ok) {}
    explicit utf8_string(std::wstring const& s)
      : m_narrow(), m_wide(s), m_flags(utf8_string::wide_ok) {}

    // conversions
    operator std::wstring const&() const
      { this->m_widen(); return this->m_wide; }
    operator std::string const&() const
      { this->m_narrowen(); return this->m_narrow; }
    char const* c_str() const
      { this->m_narrowen(); return this->m_narrow.c_str(); }

    // utility operations
    wchar_t& operator[](size_t i);
    wchar_t const& operator[](size_t i) const
      { this->widen(); return this->wide[i]; }
    bool equal(utf8_string const& s) const;
    bool less(utf8_string const& s) const;

  private:
    void widen() const    
      { if (!(this->m_flags & wide_ok)) this->m_make_wide(); }
    void narrowen() const 
      { if (!(this->m_flags & narrow_ok)) this->m_make_narrow(); }
    void make_wide() const;   // do UTF-8 to UCS4 conversion
    void make_narrow() const; // do UCS4 to UTF-8 conversion

    // data members
    mutable std::string m_narrow;
    mutable std::wstring m_wide;
    enum { neither_ok = 0, narrow_ok = 0x1, wide_ok = 0x2, both_ok = 0x3 };
    mutable char m_flags; 
  };

  inline bool 
  operator==(utf8_string const& a, Unicode_string const& b)
    { return a.equal(b); }

// utf8_string.cc:

  bool 
  utf8_string::equal(Unicode_string const& s) const
  {
    switch (this->m_flags & s.m_flags)
      {
    case utf8_string::narrow_ok: 
        return this->m_narrow == s.m_narrow; 
    case utf8_string::none: 
        this->widen(); s.widen(); 
        // fall through
    case utf8_string::wide_ok:
    case utf8_string::both_ok:
        return this->m_wide == s.m_wide;
      }
  }

  wchar_t& 
  utf8_string::operator[](size_t i)  // non-const
  {
    this->widen(); 
    this->m_flags &= ~utf8_string::narrow_ok; 
    return this->m_wide[i]; 
  }




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]