Re: G_UTF8String: Boxed Type Proposal

From: Randall Sawyer <srandallsawyer hushmail me>
To: "Jasper St. Pierre" <jstpierre mecheye net>
Cc: gtk-devel-list <gtk-devel-list gnome org>
Subject: Re: G_UTF8String: Boxed Type Proposal
Date: Thu, 17 Mar 2016 15:43:30 -0400

On 03/17/2016 02:26 PM, Jasper St. Pierre wrote:

I'll also ask what "character" means in this case, even though I know
glib also has the same confusion. Are you talking about the number of
Unicode code points in the string, or the number of grapheme clusters,
as defined by Unicode TR29 [0]? The number of code points isn't useful
for editing in all cases, even after NFC normalization. Some grapheme
clusters just don't have a single code-point representation.

[0] http://unicode.org/reports/tr29/


Good question. Thank you, Jasper.

I just took a look at TR29. The examples in the Table 1a. SampleGrapheme Clusters [1] are to me immediately illustrative of how multiplecode points may be combined into a distinct grapheme ("character"?).

As I delve into Unicode, a hierarchy of order of eight-bit strings isemerging in my mind:

Bytes [Low level] : Strings of binary octets - typically terminated bythe null byte 0x00. The number of bytes define the "length" of thestring. This is the level currently served well by glib's GString structure.

Code Points [Middle level]: Sequences of 1 to 6 bytes - each eitherundefined or serving as a packet to deliver a unique code point. Thenumber code points defines the "length" of the string. This is the levelat which I am proposing that "G_UTF8String" - or something like it -will serve developers well.

Graphemes [High level]: Sequences of one or more code points - eachserving as a packet to deliver a unique grapheme. In this case, thenumber of graphemes defines the "length" of of the string. This levelcan be best served with a strong middle level supporting it.

I am developing structures and methods to "Manage Strings of UTF-8Encoded Unicode Code Points". Middle level. Henceforth, I will refine myterminology - dropping entirely the term "character" as used in glib etal documentation - and adopting "utf8 code point" in its place.

[Geographically speaking as a north american, it is easy to slip intolazy provincial thought and to miss these distinctions. It might serveus all better if programming languages with a "char" type were to renameit "byte". Likewise, instead of "gchar" and "guchar", glib may adopt"gbyte" and "gubyte".]


[1] http://unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters

_______________________________________________
gtk-devel-list mailing list
gtk-devel-list gnome org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list

References:
- G_UTF8String: Boxed Type Proposal
  - From: Randall Sawyer
- Re: G_UTF8String: Boxed Type Proposal
  - From: Matthias Clasen
- Re: G_UTF8String: Boxed Type Proposal
  - From: Randall Sawyer
- Re: G_UTF8String: Boxed Type Proposal
  - From: Jasper St. Pierre

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]