Strings and bindings




A little while ago, Moshe Zadka sent me some mail asking
about how strings would be exported to language bindings
in GTK+-1.4. I replied that I thought that GTK_TYPE_STRING
would continue to be sufficient, since strings will
always be 8-bit and UTF-8.

But, with a bit more consideration, I'm not 100% sure
that is the right answer, so I thought I'd send this
mail out.


For GTK+-1.4, there will essentially be two types of strings:

 - Strings for display. These strings are specified to
   be iso-10646 encoded in UTF-8. We'll call these user-strings.

 - Strings not for display. (For instance, the string in:
   gtk_object_set_data(), gtk_signal_emit_by_name() or
   gtk_text_tag_create()) These do not have a specified
   encoding. We'll call these key-strings.

   (Programs would be advised to stick to straight ASCII
   for such keys, but there is no requirement for this.)

The C language mapping (a char *), and rules for passing
and memory management are the same. However, it isn't
clear that they should always map the same for all
language bindings.


Lets look at some case studies:

Perl (5.6)
==========

Strings are not marked as unicode or not. Instead utf8 processing can
be turned on for a block using the 'use utf8' pragma.

This works very will with GTK+-1.4. No changes in the binding or
applications are necessary.

Applications should be advised to use 'use utf8' whenever processing
strings from GTK+, but that is the only change necessary.


Python (1.6)
============

There are distinct Unicode-string and normal string types
conversions can be made both ways - by default the conversions
assume utf-8, but the encoding for normal strings can also
be explicitely declared.

The simplest way of handling things is to simply say that 
the binding considers all normal Python strings to be 
encoded in UTF-8. Then the rules for passing strings into
GTK+ are simple:

 - normal strings are passed through unconverted
 - unicode strings are converted from the internal representation
   to UTF-8 before passing to 

The only question is for returning key-strings 
(something which is very rare in GTK+ currently) - do we

 a) Return them as unicode strings, like user visible strings
 b) Return them as normal strings

Option b) requires a distinction in the type system between
the two types of GTK+ strings.


Things become more complex if you want to allow for setting
the assumed encoding for normal strings to something other
than utf-8. In that case, you need to do conversions 
when passing in normal strings to GTK+ for user-strings.

I don't think current plans for Python have the idea of
a "runtime encoding for normal strings", though there
probably will be some provision for specifying the encoding
of scripts during parsing. So, this may not be a necessary
feature.
 

C++
===

The C++ standard does not say anything about encodings. There
are two standard string types - string, and wstring, with
wstring being a sequence of wide characters. 

One problem with wide characters is 16-bit vs. 32-bit characters.
GTK+, because it handles things in UTF-8, has almost no
overhead for allowing 32-bit characters, and therefore 
does so. But a fixed-width wide character encoding that
uses 32-bit characters is quite expensive. 

The Unicode standard is currently only using a 16-bit characters,
all common characters for living languages are planned to be
included in the 16-bit space, and many systems do use 16-bit
characters. (Windows, Java, Python)

Howevever, there will soon be some character sets defined out
side of the 16-bit "Basic Multilingual Plane", and allowing
32-bit characters, is, IMO, nicer than confining oneself to
an almost-full character space. 
 
There are at least three ways I can think of to handle 
GTK+'s utf-8 strings in Unicode:

 - convert them to wstring

 - convert them to basic_string<gunichar> 
 
   that is, avoid the problem of the unspecified width, by
   defining a new string type using a type of specified
   width.

 - Create an STL-string-like wrapper for a utf8 string. The
   problem here is that you don't get O(1) random access, which
   will no doubt disturb some of the people reading this.


Whatever, the solution on the C++ side, the binding situation
is much like Python.

 - There should be implicit conversions between 8-bit strings
   and unicode strings that assume that the 8-bit strings are
   utf8.

 - For simplicity, all function arguments and results could
   be wide/unicode strings, or for performance, you could distinguish
   between user-strings (mapped to wide/unicode strings) and
   key-strings (mapped to strings).

If one did use the standard STL wstring type, then one would
run into the problem that there will be no 

 wstring (const char *eightbit_string);

constructor so you would probably have to subclass it to add
that converter in any case. But I'm not enough of a C++ expert
to really comment.


So, the question is, do we need two types in the type system for
user-visible and non-user-visible strings or just one? My default
answer is that we should keep it simple, and just have one, but I'm
very willing to accept input on this issue.

Regards,
                                        Owen








[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]