Re: g_utf8_validate() and NUL characters



Havoc Pennington wrote:
> Hi,
> 
> On Tue, Oct 7, 2008 at 5:50 PM, Brian J. Tarricone <bjt23 cornell edu> wrote:
>> I think what he really meant (or if not, here's my take on it) was that NUL
>> bytes aren't *printable* text... like you'd say of low-value ASCII data.
>>  Sure, it's technically "text," but most of it isn't something you can
>> represent visually in a useful manner.
> 
> Exactly. I don't see why you would ever want a nul byte, in a
> situation where text is expected.

Because my code has no control over the input?  How is U+0000 different from
U+0001?  Or other similar control characters?  Now this is glib getting in the
way...

Lemme pull a real-world example: Last year I had to fix a bug in Firefox where
a page with a nul byte crashed the browser.  Why?  Because Firefox did Unicode
validation on input, but then tried to convert UTF-16 to UTF-8 using glib and
pass it on to GTK+/Pango function.  Somewhere along the lines the nul byte was
playing bad...  That's the sort of problems being stricter than the standard
causes.  We recommend that applications validate on input and then pass around
freely.  These kinds of deviations do not help there.

> Another way to put it, I don't think nul bytes are a user-explainable
> concept. If anybody who isn't a programmer sees (how? what's the
> glyph?) a nul byte in a _text_ file, that's just bizarre. In fact, why
> would anybody want that? In a binary file sure. But binary files
> aren't utf8 _at all_.

As a user all I care is that 1) my browser/editor doesn't crash, 2) it shows
me something when I ask it to open a file.

> As a side issue, I think in most cases programs likely break if they
> load a non-nul-terminated string, so it's convenient if
> g_utf8_validate() is catching that.

I don't agree.  I have made Pango cleanly handle nul bytes.  That's not
impossible, just bugs here and there.  Programs that can't handle nul bytes
typically are so because they use nul-terminated strings.  A positive length
other than strlen(str) typically does not occur there anyway.

behdad



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]