Re: g_utf8_validate() and NUL characters

Havoc Pennington wrote:
> Hi,
> On Wed, Oct 8, 2008 at 11:00 PM, Behdad Esfahbod <behdad behdad org> wrote:
>> Lemme pull a real-world example: Last year I had to fix a bug in Firefox where
>> a page with a nul byte crashed the browser.
> What I don't see is how a nul byte is in any way different from an
> invalid sequence,

nul is invalid *just* because you declared it so.

> other than being
> strictly-speaking-allowed-by-the-unicode-spec. If we care about the
> strictly speaking there, then we have to say gtk doesn't support utf8
> because we have nul-terminated string APIs. I don't think in practice
> the character 0 is useful, and I think doing the APIs with
> nul-termination was a correct decision.

We already have some API that does not assume nul-termination with a positive

>> Why?  Because Firefox did Unicode
>> validation on input, but then tried to convert UTF-16 to UTF-8 using glib and
>> pass it on to GTK+/Pango function.  Somewhere along the lines the nul byte was
>> playing bad...  That's the sort of problems being stricter than the standard
>> causes.
> But let's turn this around. If Firefox had used g_utf8_validate()
> semantics (or g_convert_with_fallback() semantics) to validate input,
> nothing would have crashed. If anything this seems like an example of
> failing to disallow nul causing crashes.

That's like saying: "we borked interoperability, so lets convert everyone to

> I bet nul bytes in firefox still break in more obscure cases, too,
> despite fixing this bug. Pretty sure Firefox converts its strings to
> nul-terminated C strings from time to time as it uses third party
> libraries and such.

Ain't gonna prove you wrong on this one :).

>> As a user all I care is that 1) my browser/editor doesn't crash, 2) it shows
>> me something when I ask it to open a file.
> I would say allowing one specific kind of invalid file (one with a nul
> byte) does not make sense, unless you're going to open *any* file.

We disagree on whether nul is invalid to begin with.  That said,
pango_layout_set_text() indeed accepts any junk you throw at it, because I
found it useful to not be picky on input the programmer has not much control
over anyway.

It's kinda the same philosophy that makes UI applications do not handle memory
allocation failure.  What's a programmer to do when text is invalid?


> And
> g_utf8_validate() doesn't even make sense then. Then you need
> g_convert_with_fallback(), or a hex editor, or something. nul byte is
> *one of infinite ways* a file can be impossible to edit in a text
> editor.
> If you care about not crashing and showing the user something for any
> file, then you need to talk about random binary garbage, not about nul
> bytes. g_utf8_validate() becomes irrelevant. g_utf8_validate() is only
> relevant when you're going to show *text*, not when you want to show
> an *arbitrary byte stream*.
> nul bytes may be valid unicode, but they are not valid text. Or at
> least not *useful* text.
> I also would say that allowing nul bytes to unexpectedly float through
> apps is most likely going to create more crashes than it fixes. But, I
> suppose reasonable people could disagree. I have certain written tons
> and tons of code that does not work on strings with nul bytes in them,
> though.
> But my basic claim is that to get 1) my browser/editor doesn't crash,
> 2) it shows me something when I ask it to open a file, what you want
> is to load arbitrary junk, not just text files with one specific
> oddity (nul bytes).
>>> As a side issue, I think in most cases programs likely break if they
>>> load a non-nul-terminated string, so it's convenient if
>>> g_utf8_validate() is catching that.
>> I don't agree.  I have made Pango cleanly handle nul bytes.  That's not
>> impossible, just bugs here and there.
> I didn't say it was impossible, I said there would be bugs here and there ;-)
> And in fact we have the proof, in gtk there are bugs here and there.
> Otherwise we wouldn't even have this thread.
> I'd say most existing app code, and newly-written app code, will have
> bugs here and there until and unless the programmer explicitly
> considers this issue and tests it. And few will.
> Havoc

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]