Re: G_UTF8String: Boxed Type Proposal

On 03/17/2016 02:26 PM, Jasper St. Pierre wrote:
I'll also ask what "character" means in this case, even though I know
glib also has the same confusion. Are you talking about the number of
Unicode code points in the string, or the number of grapheme clusters,
as defined by Unicode TR29 [0]? The number of code points isn't useful
for editing in all cases, even after NFC normalization. Some grapheme
clusters just don't have a single code-point representation.


Good question. Thank you, Jasper.

I just took a look at TR29. The examples in the Table 1a. Sample Grapheme Clusters [1] are to me immediately illustrative of how multiple code points may be combined into a distinct grapheme ("character"?).

As I delve into Unicode, a hierarchy of order of eight-bit strings is emerging in my mind:

Bytes [Low level] : Strings of binary octets - typically terminated by the null byte 0x00. The number of bytes define the "length" of the string. This is the level currently served well by glib's GString structure.

Code Points [Middle level]: Sequences of 1 to 6 bytes - each either undefined or serving as a packet to deliver a unique code point. The number code points defines the "length" of the string. This is the level at which I am proposing that "G_UTF8String" - or something like it - will serve developers well.

Graphemes [High level]: Sequences of one or more code points - each serving as a packet to deliver a unique grapheme. In this case, the number of graphemes defines the "length" of of the string. This level can be best served with a strong middle level supporting it.

I am developing structures and methods to "Manage Strings of UTF-8 Encoded Unicode Code Points". Middle level. Henceforth, I will refine my terminology - dropping entirely the term "character" as used in glib et al documentation - and adopting "utf8 code point" in its place.

[Geographically speaking as a north american, it is easy to slip into lazy provincial thought and to miss these distinctions. It might serve us all better if programming languages with a "char" type were to rename it "byte". Likewise, instead of "gchar" and "guchar", glib may adopt "gbyte" and "gubyte".]

gtk-devel-list mailing list
gtk-devel-list gnome org

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]