=?utf-8?b?Z191dGY4X3ZhbGlkYXRlKCk=?= and NUL characters
- From: coda <coda trigger gmail com>
- To: gtk-devel-list gnome org
- Subject: g_utf8_validate() and NUL characters
- Date: Mon, 6 Oct 2008 20:42:22 +0000 (UTC)
I ended up here after pursuing the "invalid character coding" behavior of gedit.
gedit tries to convert a file to UTF-8 using g_convert, which always succeeds
when converting from an 8-bit encoding like ISO-8859-1. The converted string
contents could contain a NUL, since that's the canonical representation of
U+0000 NULL, a valid character. However, gedit must call g_utf8_validate() on
the contents to make sure that GTK+ widgets will accept the string, and
g_utf8_validate() does not consider a NUL character valid. As a result of all
this, gedit's inability to edit "binary files" is simply an inability to edit a
file with a NUL byte in it.
The bug in gedit is here, with a rather poor patch that lets the file be opened
but corrupts it if saved: http://bugzilla.gnome.org/show_bug.cgi?id=156199
I discussed this on #gtk+ with mathrick and pbor and it seems that the
assumption that UTF-8 strings are NUL-terminated and contain no NULs runs pretty
deep. A possible solution is to use "modified UTF-8" (
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 ) which represents U+0000 as
the two-byte sequence 0xC0 0x80, normally illegal in standard UTF-8, but using
the normal decoding algorithm, represents U+0000.
I filed a bug on g_utf8_validate() here:
g_utf8_validate() could simply be fixed to accept NUL characters, but functions
that return a gchar* with no length output parameter, like
gtk_text_buffer_get_text(), would require replacements.
Another possibility mentioned was making more use of GString.
Is there any reason not to support NUL/U+0000 in strings?
] [Thread Prev