Re: G_UTF8String: Boxed Type Proposal
- From: Randall Sawyer <srandallsawyer hushmail me>
- To: "Jasper St. Pierre" <jstpierre mecheye net>
- Cc: gtk-devel-list <gtk-devel-list gnome org>
- Subject: Re: G_UTF8String: Boxed Type Proposal
- Date: Thu, 17 Mar 2016 15:43:30 -0400
On 03/17/2016 02:26 PM, Jasper St. Pierre wrote:
I'll also ask what "character" means in this case, even though I know
glib also has the same confusion. Are you talking about the number of
Unicode code points in the string, or the number of grapheme clusters,
as defined by Unicode TR29 [0]? The number of code points isn't useful
for editing in all cases, even after NFC normalization. Some grapheme
clusters just don't have a single code-point representation.
[0] http://unicode.org/reports/tr29/
Good question. Thank you, Jasper.
I just took a look at TR29. The examples in the Table 1a. Sample
Grapheme Clusters [1] are to me immediately illustrative of how multiple
code points may be combined into a distinct grapheme ("character"?).
As I delve into Unicode, a hierarchy of order of eight-bit strings is
emerging in my mind:
Bytes [Low level] : Strings of binary octets - typically terminated by
the null byte 0x00. The number of bytes define the "length" of the
string. This is the level currently served well by glib's GString structure.
Code Points [Middle level]: Sequences of 1 to 6 bytes - each either
undefined or serving as a packet to deliver a unique code point. The
number code points defines the "length" of the string. This is the level
at which I am proposing that "G_UTF8String" - or something like it -
will serve developers well.
Graphemes [High level]: Sequences of one or more code points - each
serving as a packet to deliver a unique grapheme. In this case, the
number of graphemes defines the "length" of of the string. This level
can be best served with a strong middle level supporting it.
I am developing structures and methods to "Manage Strings of UTF-8
Encoded Unicode Code Points". Middle level. Henceforth, I will refine my
terminology - dropping entirely the term "character" as used in glib et
al documentation - and adopting "utf8 code point" in its place.
[Geographically speaking as a north american, it is easy to slip into
lazy provincial thought and to miss these distinctions. It might serve
us all better if programming languages with a "char" type were to rename
it "byte". Likewise, instead of "gchar" and "guchar", glib may adopt
"gbyte" and "gubyte".]
[1] http://unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters
_______________________________________________
gtk-devel-list mailing list
gtk-devel-list gnome org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]