Re: g_utf8_validate() and NUL characters
- From: Behdad Esfahbod <behdad behdad org>
- To: coda <coda trigger gmail com>
- Cc: gtk-devel-list gnome org
- Subject: Re: g_utf8_validate() and NUL characters
- Date: Tue, 07 Oct 2008 16:55:29 -0400
> I ended up here after pursuing the "invalid character coding" behavior of gedit.
> gedit tries to convert a file to UTF-8 using g_convert, which always succeeds
> when converting from an 8-bit encoding like ISO-8859-1. The converted string
> contents could contain a NUL, since that's the canonical representation of
> U+0000 NULL, a valid character. However, gedit must call g_utf8_validate() on
> the contents to make sure that GTK+ widgets will accept the string, and
> g_utf8_validate() does not consider a NUL character valid. As a result of all
> this, gedit's inability to edit "binary files" is simply an inability to edit a
> file with a NUL byte in it.
Have you tried to work around GTK+'s issue using a loop to skip over the NUL's
to see if there are other issues?
> The bug in gedit is here, with a rather poor patch that lets the file be opened
> but corrupts it if saved: http://bugzilla.gnome.org/show_bug.cgi?id=156199
Note that while your approach to convert from ISO-8859-1 to UTF-8 "works",
then entering UTF-8 text into that file and trying to save does not work. So
it's an either text or binary approach to editing. A truly useful editing
mode is to be able to open a file with mixed UTF-8 text and binary data, edit
the UTF-8 text, and save.
> I discussed this on #gtk+ with mathrick and pbor and it seems that the
> assumption that UTF-8 strings are NUL-terminated and contain no NULs runs pretty
> deep. A possible solution is to use "modified UTF-8" (
> http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 ) which represents U+0000 as
> the two-byte sequence 0xC0 0x80, normally illegal in standard UTF-8, but using
> the normal decoding algorithm, represents U+0000.
I believe the NUL bytes are the smallest of your problems, and can be fixed in
> I filed a bug on g_utf8_validate() here:
> g_utf8_validate() could simply be fixed to accept NUL characters,
Yes, that's what I prefer too. Many glib functions take a length argument,
but its interpretation varies across different functions significantly. Most
interpret it as:
- If -1, str is nul-terminated. Otherwise length is the *maximum* length in
bytes of str.
The problematic part is the "max" there. That disallows nul bytes in str even
if a length is provided. The reasoning for this I've heard from Owen is to
allow slicing a prefix of a string. Say, for example, "at most 20 bytes".
However, that approach is inherently incompatible with UTF-8 text. One can't
simply take 20 bytes at the start of the string and hope that it would be
A saner interpretation would be:
- If -1, str is nul-terminated. Otherwise length is the length in bytes of str.
And there's g_utf8_validate() that interprets it as:
- If -1, str is nul-terminated. Otherwise, length is the length in bytes of
str. Str should not be nul-terminated in the first length bytes.
Ugh. Why is that? Who knows? Matthias suggested that because a string
claiming to be length bytes long but terminating prematurely is not valid.
However, that statement assumes that string is nul-terminated.
So yeah, it's all a mess. I like to somehow clean the mess, but it may have
to wait till glib 3.0... I don't think the implications of the changes will
be very catastrophic, but can't know without extensively going over all uses
in all projects... Some Google Code voodoo may help us get a rough feeling of
> but functions
> that return a gchar* with no length output parameter, like
> gtk_text_buffer_get_text(), would require replacements.
> Another possibility mentioned was making more use of GString.
Not a huge fan.
> Is there any reason not to support NUL/U+0000 in strings?
None that I know of, and I've been trying to fix this in Pango.
] [Thread Prev