Re: g_utf8_validate() and NUL characters


On Wed, Oct 8, 2008 at 11:00 PM, Behdad Esfahbod <behdad behdad org> wrote:
> Lemme pull a real-world example: Last year I had to fix a bug in Firefox where
> a page with a nul byte crashed the browser.

What I don't see is how a nul byte is in any way different from an
invalid sequence, other than being
strictly-speaking-allowed-by-the-unicode-spec. If we care about the
strictly speaking there, then we have to say gtk doesn't support utf8
because we have nul-terminated string APIs. I don't think in practice
the character 0 is useful, and I think doing the APIs with
nul-termination was a correct decision.

The nul byte has the downside that, as we have been pointing out about
the gtk stack, C programmers do *not* expect strings to have nul bytes
in them.

This is why nul is different from other nonprintable characters: that
it breaks a bunch of C code, in practice. Nobody does anything special
about the other nonprintables, but people treat nul as a special case
all over the place.

> Why?  Because Firefox did Unicode
> validation on input, but then tried to convert UTF-16 to UTF-8 using glib and
> pass it on to GTK+/Pango function.  Somewhere along the lines the nul byte was
> playing bad...  That's the sort of problems being stricter than the standard
> causes.

But let's turn this around. If Firefox had used g_utf8_validate()
semantics (or g_convert_with_fallback() semantics) to validate input,
nothing would have crashed. If anything this seems like an example of
failing to disallow nul causing crashes.

I bet nul bytes in firefox still break in more obscure cases, too,
despite fixing this bug. Pretty sure Firefox converts its strings to
nul-terminated C strings from time to time as it uses third party
libraries and such.

> As a user all I care is that 1) my browser/editor doesn't crash, 2) it shows
> me something when I ask it to open a file.

I would say allowing one specific kind of invalid file (one with a nul
byte) does not make sense, unless you're going to open *any* file. And
g_utf8_validate() doesn't even make sense then. Then you need
g_convert_with_fallback(), or a hex editor, or something. nul byte is
*one of infinite ways* a file can be impossible to edit in a text

If you care about not crashing and showing the user something for any
file, then you need to talk about random binary garbage, not about nul
bytes. g_utf8_validate() becomes irrelevant. g_utf8_validate() is only
relevant when you're going to show *text*, not when you want to show
an *arbitrary byte stream*.

nul bytes may be valid unicode, but they are not valid text. Or at
least not *useful* text.

I also would say that allowing nul bytes to unexpectedly float through
apps is most likely going to create more crashes than it fixes. But, I
suppose reasonable people could disagree. I have certain written tons
and tons of code that does not work on strings with nul bytes in them,

But my basic claim is that to get 1) my browser/editor doesn't crash,
2) it shows me something when I ask it to open a file, what you want
is to load arbitrary junk, not just text files with one specific
oddity (nul bytes).

>> As a side issue, I think in most cases programs likely break if they
>> load a non-nul-terminated string, so it's convenient if
>> g_utf8_validate() is catching that.
> I don't agree.  I have made Pango cleanly handle nul bytes.  That's not
> impossible, just bugs here and there.

I didn't say it was impossible, I said there would be bugs here and there ;-)

And in fact we have the proof, in gtk there are bugs here and there.
Otherwise we wouldn't even have this thread.

I'd say most existing app code, and newly-written app code, will have
bugs here and there until and unless the programmer explicitly
considers this issue and tests it. And few will.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]