Re: G_UTF8String: Boxed Type Proposal
- From: Randall Sawyer <srandallsawyer hushmail me>
- To: Matthias Clasen <matthias clasen gmail com>, gtk-devel-list <gtk-devel-list gnome org>
- Subject: Re: G_UTF8String: Boxed Type Proposal
- Date: Fri, 18 Mar 2016 09:57:49 -0400
On 03/17/2016 07:23 PM, Matthias Clasen wrote:
Sure, code point works too. Anyway, enough with the ontology, we're
not a standards body....
I still don't think that we need a utf8-string datatype.
I have questions, then.
Here are excerpts from the current master files:
"gstring.h"
...
struct _GString
{
gchar *str;
gsize len;
gsize allocated_len;
};
...
"gstring.c"
...
/**
* g_string_insert_len:
* @string: a #GString
* @pos: position in @string where insertion should
* happen, or -1 for at the end
* @val: bytes to insert
* @len: number of bytes of @val to insert
*
* Inserts @len bytes of @val into @string at @pos.
* Because @len is provided, @val may contain embedded
* nuls and need not be nul-terminated. If @pos is -1,
* bytes are inserted at the end of the string.
*
* Since this function does not stop at nul bytes, it is
* the caller's responsibility to ensure that @val has at
* least @len addressable bytes.
*
* Returns: (transfer none): @string
*/
GString *
g_string_insert_len (GString *string,
gssize pos,
const gchar *val,
gssize len)
...
/**
* g_string_insert_unichar:
* @string: a #GString
* @pos: the position at which to insert character, or -1
* to append at the end of the string
* @wc: a Unicode character
*
* Converts a Unicode character into UTF-8, and insert it
* into the string at the given position.
*
* Returns: (transfer none): @string
*/
GString *
g_string_insert_unichar (GString *string,
gssize pos,
gunichar wc)
...
1) Since GString handles insertion of both raw strings and gunichar
values, then it is safe to assume that the raw strings are treated as UTF-8.
In that case, does the value of the argument `pos' refer to C array
index or to UTF-8 offset? [I had to read the source code to find out.]
2) If the former is true - which it is - then the developer will need to
call g_utf8_strlen() to determine if there are multi-byte sequences to
navigate - and if there are - g_utf8_offset_to_pointer() to locate the
array index. Doesn't this increase processing demand?
3) Wouldn't it be helpful to keep track of how many code points
("characters")are stored in the GString - a number which may be less
than the value of GString.len - without needing to call g_utf8_strlen()
each time to find out?
4) Would my efforts be better spent editing patches of "gstring.h" and
"gstring.c" - or - to proceed as I am to introduce a parallel alternative?
If the answer to (4) is yes, then how about the following modifications?
Change "gstring.h":
...
struct _GString
{
gchar *str;
gsize len;
gsize allocated_len;
gsize utf8_len;
};
...
Add to "gstring.h":
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_truncate_utf8 (GString *string,
gsize utf8_len);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_len_utf8 (GString *string,
gssize offset,
const gchar *val,
gssize utf8_len);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_utf8 (GString *string,
gssize offset,
const gchar *val);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_c_utf8 (GString *string,
gssize offset,
gchar c);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_unichar_utf8 (GString *string,
gssize offset,
gchar wc);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_overwrite_utf8 (GString *string,
gssize offset,
const gchar *val);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_overwrite_len_utf8 (GString *string,
gssize offset,
const gchar *val,
gssize utf8_len);
Add to "utf8.c":
...
GLIB_AVAILABLE_IN_2_XX
void g_utf8_measure (const gchar *utf8,
glong max_len,
gsize *utf8_len,
gsize *byte_len,
gboolean validate);
GLIB_AVAILABLE_IN_2_XX
gchar* g_utf8_sized_offset_to_pointer (const gchar *utf8,
glong offset,
gsize utf8_len,
gsize byte_len);
...
Note 1: The GString functions ending in *_utf8 would check if values of
GString.len and GString.utf8_len are equal - and directly access
contained gchar array if they are, thus dispensing with looking up
pointer from offset.
Note 2: The function g_utf8_measure() iterates the passed array once,
simultaneously arriving at the values which would be returned by
g_utf8_strlen() and strlen() - dispensing with the need to iterate over
the array twice, which the current means demand. If `validate' is set to
TRUE, then a private validating function is called. If `utf8' is known
to be valid, then the user calls the function with `validate' set to
FALSE - in which case a faster "skipping" private function is called.
Note 3: The function g_utf8_sized_offset_to_pointer() first compares
`utf8_len' and `byte_len', reverting to simple pointer arithmetic if
they are equal - or - if they are not, then comparing `offset' and
`utf8_len' to determine whether to call g_utf8_offset_to_pointer() from
the beginning or the end of the array.
Thank you, Matthias, for your time and attention.
I am sincere in requesting your advice in how best to proceed.
_______________________________________________
gtk-devel-list mailing list
gtk-devel-list gnome org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]