Re: g_utf8_validate() and NUL characters



Hi,

On Wed, Oct 8, 2008 at 11:00 PM, Behdad Esfahbod <behdad behdad org> wrote:
> Lemme pull a real-world example: Last year I had to fix a bug in Firefox where
> a page with a nul byte crashed the browser.

What I don't see is how a nul byte is in any way different from an
invalid sequence, other than being
strictly-speaking-allowed-by-the-unicode-spec. If we care about the
strictly speaking there, then we have to say gtk doesn't support utf8
because we have nul-terminated string APIs. I don't think in practice
the character 0 is useful, and I think doing the APIs with
nul-termination was a correct decision.

The nul byte has the downside that, as we have been pointing out about
the gtk stack, C programmers do *not* expect strings to have nul bytes
in them.

This is why nul is different from other nonprintable characters: that
it breaks a bunch of C code, in practice. Nobody does anything special
about the other nonprintables, but people treat nul as a special case
all over the place.

> Why?  Because Firefox did Unicode
> validation on input, but then tried to convert UTF-16 to UTF-8 using glib and
> pass it on to GTK+/Pango function.  Somewhere along the lines the nul byte was
> playing bad...  That's the sort of problems being stricter than the standard
> causes.

But let's turn this around. If Firefox had used g_utf8_validate()
semantics (or g_convert_with_fallback() semantics) to validate input,
nothing would have crashed. If anything this seems like an example of
failing to disallow nul causing crashes.

I bet nul bytes in firefox still break in more obscure cases, too,
despite fixing this bug. Pretty sure Firefox converts its strings to
nul-terminated C strings from time to time as it uses third party
libraries and such.

> As a user all I care is that 1) my browser/editor doesn't crash, 2) it shows
> me something when I ask it to open a file.

I would say allowing one specific kind of invalid file (one with a nul
byte) does not make sense, unless you're going to open *any* file. And
g_utf8_validate() doesn't even make sense then. Then you need
g_convert_with_fallback(), or a hex editor, or something. nul byte is
*one of infinite ways* a file can be impossible to edit in a text
editor.

If you care about not crashing and showing the user something for any
file, then you need to talk about random binary garbage, not about nul
bytes. g_utf8_validate() becomes irrelevant. g_utf8_validate() is only
relevant when you're going to show *text*, not when you want to show
an *arbitrary byte stream*.

nul bytes may be valid unicode, but they are not valid text. Or at
least not *useful* text.

I also would say that allowing nul bytes to unexpectedly float through
apps is most likely going to create more crashes than it fixes. But, I
suppose reasonable people could disagree. I have certain written tons
and tons of code that does not work on strings with nul bytes in them,
though.

But my basic claim is that to get 1) my browser/editor doesn't crash,
2) it shows me something when I ask it to open a file, what you want
is to load arbitrary junk, not just text files with one specific
oddity (nul bytes).

>> As a side issue, I think in most cases programs likely break if they
>> load a non-nul-terminated string, so it's convenient if
>> g_utf8_validate() is catching that.
>
> I don't agree.  I have made Pango cleanly handle nul bytes.  That's not
> impossible, just bugs here and there.

I didn't say it was impossible, I said there would be bugs here and there ;-)

And in fact we have the proof, in gtk there are bugs here and there.
Otherwise we wouldn't even have this thread.

I'd say most existing app code, and newly-written app code, will have
bugs here and there until and unless the programmer explicitly
considers this issue and tests it. And few will.

Havoc


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]