Re: g_utf8_validate() and NUL characters
- From: Dave Benson <daveb idealab com>
- To: Behdad Esfahbod <behdad behdad org>
- Cc: gtk-devel-list gnome org, Havoc Pennington <hp pobox com>
- Subject: Re: g_utf8_validate() and NUL characters
- Date: Thu, 9 Oct 2008 19:54:26 -0700
I recent ran across this situation.
The simple fact is that NUL (character 0) (also: not NULL which is a pointer)
is nowhere stated to be an invalid unicode character
in the unicode spec (g_unichar_validate(0) returns TRUE btw),
and the UTF-8 spec doesn't prohibit 0, and following its wording literally,
unicode char 0 transforms to a single byte 0.
Nonetheless, I think g_utf8_validate() should be kept as is,
at least for a long time. It is misnamed, but it serves such
a useful purpose that it is widely deployed.
I think it should have been named g_utf8_validate_string()
b/c that's a more accurate name. I think it's fair to
say that strings are NUL-terminated in C (e.g. str* functions
and string literals) but there's no standard saying what a string is,
so who knows.
The simple fact is that MOST strings in structures, param-lists etc in C
so, you definitely want a function like g_utf8_validate_string()
to ensure that a string doesn't contain NUL in a situation
that it actually cannot be used.
It would be nice if a g_utf8_validate_data (const char *str,
could be added... it should follow the UTF-8 spec permitting character 0.
Perhaps g_utf8_validate_string() could be added (identical to current
g_utf8_validate() or maybe removing the size param,
and possibly deprecating that function as confusing).
But replacing it with the new semantics should probably wait a long time.
This is all rather tangential, I believe to the
original problem with gedit. It should do it's own UTF-8
validation, b/c a text editor likes to handle invalid
UTF-8 specially. UTF-8 is a spec that will not change,
and is about 10 lines of code; you can afford to include your own version.
It should do something smarter first-off
to handle other encodings ie detect Latin1, obey locale, etc etc.
And it could default to markup like <red>HEX</red> for non-UTF8 bytes.
That's a lot different that the handling you want from say,
a configuration parser.
] [Thread Prev