Re: g_utf8_validate() and NUL characters
- From: "Havoc Pennington" <hp pobox com>
- To: "Behdad Esfahbod" <behdad behdad org>
- Cc: gtk-devel-list gnome org
- Subject: Re: g_utf8_validate() and NUL characters
- Date: Wed, 8 Oct 2008 23:47:23 -0400
On Wed, Oct 8, 2008 at 11:00 PM, Behdad Esfahbod <behdad behdad org> wrote:
> Lemme pull a real-world example: Last year I had to fix a bug in Firefox where
> a page with a nul byte crashed the browser.
What I don't see is how a nul byte is in any way different from an
invalid sequence, other than being
strictly-speaking-allowed-by-the-unicode-spec. If we care about the
strictly speaking there, then we have to say gtk doesn't support utf8
because we have nul-terminated string APIs. I don't think in practice
the character 0 is useful, and I think doing the APIs with
nul-termination was a correct decision.
The nul byte has the downside that, as we have been pointing out about
the gtk stack, C programmers do *not* expect strings to have nul bytes
This is why nul is different from other nonprintable characters: that
it breaks a bunch of C code, in practice. Nobody does anything special
about the other nonprintables, but people treat nul as a special case
all over the place.
> Why? Because Firefox did Unicode
> validation on input, but then tried to convert UTF-16 to UTF-8 using glib and
> pass it on to GTK+/Pango function. Somewhere along the lines the nul byte was
> playing bad... That's the sort of problems being stricter than the standard
But let's turn this around. If Firefox had used g_utf8_validate()
semantics (or g_convert_with_fallback() semantics) to validate input,
nothing would have crashed. If anything this seems like an example of
failing to disallow nul causing crashes.
I bet nul bytes in firefox still break in more obscure cases, too,
despite fixing this bug. Pretty sure Firefox converts its strings to
nul-terminated C strings from time to time as it uses third party
libraries and such.
> As a user all I care is that 1) my browser/editor doesn't crash, 2) it shows
> me something when I ask it to open a file.
I would say allowing one specific kind of invalid file (one with a nul
byte) does not make sense, unless you're going to open *any* file. And
g_utf8_validate() doesn't even make sense then. Then you need
g_convert_with_fallback(), or a hex editor, or something. nul byte is
*one of infinite ways* a file can be impossible to edit in a text
If you care about not crashing and showing the user something for any
file, then you need to talk about random binary garbage, not about nul
bytes. g_utf8_validate() becomes irrelevant. g_utf8_validate() is only
relevant when you're going to show *text*, not when you want to show
an *arbitrary byte stream*.
nul bytes may be valid unicode, but they are not valid text. Or at
least not *useful* text.
I also would say that allowing nul bytes to unexpectedly float through
apps is most likely going to create more crashes than it fixes. But, I
suppose reasonable people could disagree. I have certain written tons
and tons of code that does not work on strings with nul bytes in them,
But my basic claim is that to get 1) my browser/editor doesn't crash,
2) it shows me something when I ask it to open a file, what you want
is to load arbitrary junk, not just text files with one specific
oddity (nul bytes).
>> As a side issue, I think in most cases programs likely break if they
>> load a non-nul-terminated string, so it's convenient if
>> g_utf8_validate() is catching that.
> I don't agree. I have made Pango cleanly handle nul bytes. That's not
> impossible, just bugs here and there.
I didn't say it was impossible, I said there would be bugs here and there ;-)
And in fact we have the proof, in gtk there are bugs here and there.
Otherwise we wouldn't even have this thread.
I'd say most existing app code, and newly-written app code, will have
bugs here and there until and unless the programmer explicitly
considers this issue and tests it. And few will.
] [Thread Prev