Re: g_utf8_validate() and NUL characters

From: Behdad Esfahbod <behdad behdad org>
To: Freddie Unpenstein <fredderic excite com>
Cc: gtk-devel-list gnome org
Subject: Re: g_utf8_validate() and NUL characters
Date: Sat, 11 Oct 2008 17:22:56 -0400

Freddie Unpenstein wrote:

> My assertion here is basically this; ASCII text (defined here as
> characters 1-127) encode into UTF-8 as-is. Anything else in the 0-255
> set is considered binary, and should be encoded in its shortest
> multi-byte UTF-8 form. No more, and no less. Call it Glib encoding.

So, go do it and try get it in glib if you wish.  I don't see what that has to
do with g_utf8_* though when it's apparently not utf8, neither called so.

> I believe, that differs from the UTF-8 specification ONLY in the
> handling of the NULL byte, but then I've been avoiding dealing with
> UTF-8 for the most part for exactly this reason. When UTF-8 is a strict
> issue, I've been using higher-level scripted languages instead, that
> already deal with it natively. (And I'm not 100% certain, but I think
> that's essentially what they all do.)

False.  XML doesn't do such nasty stuff.  HTTP doesn't either.  HTML doesn't
either.  *Only* Java does.  There's a reason standarrds are good, and there's
a reason people use standards.

> A "convert to UTF-8" function given a UTF-8 input with a 6-byte
> representation of the character 'A' would store the regular single-byte
> representation.

False.  It errs on an overlong representation of 'A'.  If it doesn't, it's a bug.

> I know it's a bit of a mind-bend from where Glib/GTK is right now with
> encodings, Glib/GTK developers don't like hearing from us lowly humans,
> and there's always resistance to change, but specifications often change
> when needed to meet practical requirements (no one has ever written a
> 100% perfect specification), and personally, changing the platform and
> established behaviour (much harder and more dangerous to attempt to do)
> to suit the UTF-8 specification in this rather trivial issue seems far
> more wrong than breaking the UTF-8 specification slightly for internal
> use only. (The key being the "for internal use only", all "convert to
> UTF-8" functions would still produce the strict interpretation with
> \0's) It seems furthermore to be more correct in this day and age to
> bend a rule like this that makes it SAFER by allowing the old
> NULL-terminated string handling to function, and not force programmers
> to deal specially with length specifiers, which happens to all too
> frequently be a great source of coding mistakes. This would also make it
> easier to migrate, for example, to UTF-16 at some point in time -
> everything will already be converting between UTF-8 to Glib-8, so
> transitioning to Glib-16 would be an entirely internal affair.

You're totally missing the point.  Allowing an alternate nul representation
opens security problems much harder to track down than a crashing application.
 There's a lot of literature to read about this already.  Not going to
continue it here.

behdad


> Fredderic
> ------------------------------------------------------------------------
>    	Italian Charm Bracelet
> <http://tagline.excite.com/fc/JkJQPTgLuTcOdlmN1YthoWcmwJpeghCVmKv3BTMZK4ss0jqUfbgWLC/>
> Click for fashionable Italian charm bracelets.
> <http://tagline.excite.com/fc/JkJQPTgLuTcOdlmN1YthoWcmwJpeghCVmKv3BTMZK4ss0jqUfbgWLC/>
> Click here for more information
> <http://tagline.excite.com/fc/JkJQPTgLuTcOdlmN1YthoWcmwJpeghCVmKv3BTMZK4ss0jqUfbgWLC/>
> 
>  
>

References:
- Re: g_utf8_validate() and NUL characters
  - From: Freddie Unpenstein

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]