Re: g_utf8_validate() and NUL characters

From: "Havoc Pennington" <hp pobox com>
To: coda <coda trigger gmail com>
Cc: gtk-devel-list gnome org
Subject: Re: g_utf8_validate() and NUL characters
Date: Tue, 7 Oct 2008 00:04:19 -0400

Hi,

On Mon, Oct 6, 2008 at 4:42 PM, coda <coda trigger gmail com> wrote:
> As a result of all
> this, gedit's inability to edit "binary files" is simply an inability to edit a
> file with a NUL byte in it.

That doesn't seem true; a binary file could be invalid UTF-8 (or
whatever encoding) in thousands of ways besides embedded nul, no?

I mean, editing binary and editing text just isn't the same thing.  A
text editor must understand the encoding so it can display and edit
the text as text.

The only way to edit binary in a text editor (such as GtkTextView or
whatever) is to somehow convert the binary to text. So for example
g_convert_with_fallback() might be appropriate. But that is pretty
much the same solution as the patch here:

> The bug in gedit is here, with a rather poor patch that lets the file be opened
> but corrupts it if saved: http://bugzilla.gnome.org/show_bug.cgi?id=156199

The patch just falls back to "?" on nul, but it could as easily do
that on whatever invalid text it finds, as with
g_convert_with_fallback(), not just on nul.

This approach inherently will not round trip.

Perhaps a better solution is to come up with some two-way
binary-to-text conversion, such as converting the file to a bunch of
hex digits. Allow editing the hex digits, then reconvert to binary on
save. Or something. I don't know. I think for binary files a text
editor just doesn't really work.

Or maybe TextView could be smart enough to explicitly separate "binary
garbage" segments, and display/edit them differently from text,
analogous to how it handles images. That's a pretty involved patch
though probably. And it would encounter fundamental text view
limitations, e.g. that it does not scale with overly long paragraphs
(too much stuff between newlines).

> g_utf8_validate() could simply be fixed to accept NUL characters, but functions
> that return a gchar* with no length output parameter, like
> gtk_text_buffer_get_text(), would require replacements.

I think you'd find that GtkTextView breaks in some fairly deep ways,
though maybe not.

> Is there any reason not to support NUL/U+0000 in strings?

The point of not allowing nul in g_utf8_validate() I think is that nul
is not valid text. It may be valid unicode in some technical sense,
but it isn't text, in the same sense that malformed utf8 isn't text.

Havoc

Follow-Ups:
- Re: g_utf8_validate() and NUL characters
  - From: Behdad Esfahbod

References:
- =?utf-8?b?Z191dGY4X3ZhbGlkYXRlKCk=?= and NUL characters
  - From: coda

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]