Re: g_utf8_validate() and NUL characters

From: Behdad Esfahbod <behdad behdad org>
To: Havoc Pennington <hp pobox com>
Cc: gtk-devel-list gnome org
Subject: Re: g_utf8_validate() and NUL characters
Date: Thu, 09 Oct 2008 20:46:12 -0400

Havoc Pennington wrote:
> Hi,
> 
> On Wed, Oct 8, 2008 at 11:00 PM, Behdad Esfahbod <behdad behdad org> wrote:
>> Lemme pull a real-world example: Last year I had to fix a bug in Firefox where
>> a page with a nul byte crashed the browser.
> 
> What I don't see is how a nul byte is in any way different from an
> invalid sequence,

nul is invalid *just* because you declared it so.

> other than being
> strictly-speaking-allowed-by-the-unicode-spec. If we care about the
> strictly speaking there, then we have to say gtk doesn't support utf8
> because we have nul-terminated string APIs. I don't think in practice
> the character 0 is useful, and I think doing the APIs with
> nul-termination was a correct decision.

We already have some API that does not assume nul-termination with a positive
length.

>> Why?  Because Firefox did Unicode
>> validation on input, but then tried to convert UTF-16 to UTF-8 using glib and
>> pass it on to GTK+/Pango function.  Somewhere along the lines the nul byte was
>> playing bad...  That's the sort of problems being stricter than the standard
>> causes.
> 
> But let's turn this around. If Firefox had used g_utf8_validate()
> semantics (or g_convert_with_fallback() semantics) to validate input,
> nothing would have crashed. If anything this seems like an example of
> failing to disallow nul causing crashes.

That's like saying: "we borked interoperability, so lets convert everyone to
glib."


> I bet nul bytes in firefox still break in more obscure cases, too,
> despite fixing this bug. Pretty sure Firefox converts its strings to
> nul-terminated C strings from time to time as it uses third party
> libraries and such.

Ain't gonna prove you wrong on this one :).

>> As a user all I care is that 1) my browser/editor doesn't crash, 2) it shows
>> me something when I ask it to open a file.
> 
> I would say allowing one specific kind of invalid file (one with a nul
> byte) does not make sense, unless you're going to open *any* file.

We disagree on whether nul is invalid to begin with.  That said,
pango_layout_set_text() indeed accepts any junk you throw at it, because I
found it useful to not be picky on input the programmer has not much control
over anyway.

  http://www.pango.org/ScriptGallery/

It's kinda the same philosophy that makes UI applications do not handle memory
allocation failure.  What's a programmer to do when text is invalid?


behdad

> And
> g_utf8_validate() doesn't even make sense then. Then you need
> g_convert_with_fallback(), or a hex editor, or something. nul byte is
> *one of infinite ways* a file can be impossible to edit in a text
> editor.
> 
> If you care about not crashing and showing the user something for any
> file, then you need to talk about random binary garbage, not about nul
> bytes. g_utf8_validate() becomes irrelevant. g_utf8_validate() is only
> relevant when you're going to show *text*, not when you want to show
> an *arbitrary byte stream*.
> 
> nul bytes may be valid unicode, but they are not valid text. Or at
> least not *useful* text.
> 
> I also would say that allowing nul bytes to unexpectedly float through
> apps is most likely going to create more crashes than it fixes. But, I
> suppose reasonable people could disagree. I have certain written tons
> and tons of code that does not work on strings with nul bytes in them,
> though.
> 
> But my basic claim is that to get 1) my browser/editor doesn't crash,
> 2) it shows me something when I ask it to open a file, what you want
> is to load arbitrary junk, not just text files with one specific
> oddity (nul bytes).
> 
>>> As a side issue, I think in most cases programs likely break if they
>>> load a non-nul-terminated string, so it's convenient if
>>> g_utf8_validate() is catching that.
>> I don't agree.  I have made Pango cleanly handle nul bytes.  That's not
>> impossible, just bugs here and there.
> 
> I didn't say it was impossible, I said there would be bugs here and there ;-)
> 
> And in fact we have the proof, in gtk there are bugs here and there.
> Otherwise we wouldn't even have this thread.
> 
> I'd say most existing app code, and newly-written app code, will have
> bugs here and there until and unless the programmer explicitly
> considers this issue and tests it. And few will.
> 
> Havoc
>

Follow-Ups:
- Re: g_utf8_validate() and NUL characters
  - From: Dave Benson
- Re: g_utf8_validate() and NUL characters
  - From: Havoc Pennington

References:
- =?utf-8?b?Z191dGY4X3ZhbGlkYXRlKCk=?= and NUL characters
  - From: coda
- Re: g_utf8_validate() and NUL characters
  - From: Havoc Pennington
- Re: g_utf8_validate() and NUL characters
  - From: Behdad Esfahbod
- Re: g_utf8_validate() and NUL characters
  - From: Brian J. Tarricone
- Re: g_utf8_validate() and NUL characters
  - From: Havoc Pennington
- Re: g_utf8_validate() and NUL characters
  - From: Behdad Esfahbod
- Re: g_utf8_validate() and NUL characters
  - From: Havoc Pennington

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]