Re: G_UTF8String: Boxed Type Proposal

I'll also ask what "character" means in this case, even though I know
glib also has the same confusion. Are you talking about the number of
Unicode code points in the string, or the number of grapheme clusters,
as defined by Unicode TR29 [0]? The number of code points isn't useful
for editing in all cases, even after NFC normalization. Some grapheme
clusters just don't have a single code-point representation.


On Thu, Mar 17, 2016 at 11:18 AM, Randall Sawyer
<srandallsawyer hushmail me> wrote:
On 03/17/2016 10:39 AM, Randall Sawyer wrote:

On 03/17/2016 09:30 AM, Matthias Clasen wrote:

I believe that you haven't found such a proposal because most people
don't see much use in a separate boxed type for utf8 strings. Every
string we pass around in GLib and GTK+, and every char * in their APIs
is expected to be in utf8. The few exceptions to this rule are
explicitly documented.

GLib already provides a number of utilities for dealing with utf8
strings in terms of characters, such as g_utf8_strlen,
g_utf8_substring, g_utf8_find_next/prev_char. We can certainly discuss
adding to that list, if there are glaring omissions.

Here is the vision: Once raw string data - or gunichar value - has been
passed and validated into the construction of a "G_UTF8String" structure,
then contents of two-or-more of these can be easily combined without the
need for additional measuring or validating.

Alright Matthias, after your thoughtful response, I have come to the
following conclusion:  When considering management of dynamically allocated
UTF-8 strings, there are actually two points to consider: 1) Whether the
byte sequences are valid per IETF RFC 3629 Section 4 - and - 2) The number
of distinct characters represented in the string vs. the total number of
bytes used to represent these.

If someone were to write a widget library or an application using libraries
which ensure valid UTF-8 as input - Gdk key-press events and GtkIMContexts
for example - then it wouldn't make sense to run those strings through yet
another course of validation. That addresses the first issue.

There is still the question of character length vs. byte length.

Therefore - from what you have told me - I will be sure to present methods
which feature validation as an option and not as the rule.

Thank you.

gtk-devel-list mailing list
gtk-devel-list gnome org


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]