Re: g_utf8_validate() and NUL characters


On Mon, Oct 6, 2008 at 4:42 PM, coda <coda trigger gmail com> wrote:
> As a result of all
> this, gedit's inability to edit "binary files" is simply an inability to edit a
> file with a NUL byte in it.

That doesn't seem true; a binary file could be invalid UTF-8 (or
whatever encoding) in thousands of ways besides embedded nul, no?

I mean, editing binary and editing text just isn't the same thing.  A
text editor must understand the encoding so it can display and edit
the text as text.

The only way to edit binary in a text editor (such as GtkTextView or
whatever) is to somehow convert the binary to text. So for example
g_convert_with_fallback() might be appropriate. But that is pretty
much the same solution as the patch here:

> The bug in gedit is here, with a rather poor patch that lets the file be opened
> but corrupts it if saved:

The patch just falls back to "?" on nul, but it could as easily do
that on whatever invalid text it finds, as with
g_convert_with_fallback(), not just on nul.

This approach inherently will not round trip.

Perhaps a better solution is to come up with some two-way
binary-to-text conversion, such as converting the file to a bunch of
hex digits. Allow editing the hex digits, then reconvert to binary on
save. Or something. I don't know. I think for binary files a text
editor just doesn't really work.

Or maybe TextView could be smart enough to explicitly separate "binary
garbage" segments, and display/edit them differently from text,
analogous to how it handles images. That's a pretty involved patch
though probably. And it would encounter fundamental text view
limitations, e.g. that it does not scale with overly long paragraphs
(too much stuff between newlines).

> g_utf8_validate() could simply be fixed to accept NUL characters, but functions
> that return a gchar* with no length output parameter, like
> gtk_text_buffer_get_text(), would require replacements.

I think you'd find that GtkTextView breaks in some fairly deep ways,
though maybe not.

> Is there any reason not to support NUL/U+0000 in strings?

The point of not allowing nul in g_utf8_validate() I think is that nul
is not valid text. It may be valid unicode in some technical sense,
but it isn't text, in the same sense that malformed utf8 isn't text.


