=?utf-8?b?Z191dGY4X3ZhbGlkYXRlKCk=?= and NUL characters

I ended up here after pursuing the "invalid character coding" behavior of gedit.
 gedit tries to convert a file to UTF-8 using g_convert, which always succeeds
when converting from an 8-bit encoding like ISO-8859-1. The converted string
contents could contain a NUL, since that's the canonical representation of
U+0000 NULL, a valid character. However, gedit must call g_utf8_validate() on
the contents to make sure that GTK+ widgets will accept the string, and
g_utf8_validate() does not consider a NUL character valid. As a result of all
this, gedit's inability to edit "binary files" is simply an inability to edit a
file with a NUL byte in it.

The bug in gedit is here, with a rather poor patch that lets the file be opened
but corrupts it if saved: http://bugzilla.gnome.org/show_bug.cgi?id=156199

I discussed this on #gtk+ with mathrick and pbor and it seems that the
assumption that UTF-8 strings are NUL-terminated and contain no NULs runs pretty
deep. A possible solution is to use "modified UTF-8" (
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 ) which represents U+0000 as
the two-byte sequence 0xC0 0x80, normally illegal in standard UTF-8, but using
the normal decoding algorithm, represents U+0000.

I filed a bug on g_utf8_validate() here:

g_utf8_validate() could simply be fixed to accept NUL characters, but functions
that return a gchar* with no length output parameter, like 
gtk_text_buffer_get_text(), would require replacements.

Another possibility mentioned was making more use of GString.

Is there any reason not to support NUL/U+0000 in strings?

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]