String encoding within Glib/GTK (Was: g_utf8_validate() and NUL characters)

From: "Freddie Unpenstein" <fredderic excite com>
To: behdad behdad org
Cc: gtk-devel-list gnome org
Subject: String encoding within Glib/GTK (Was: g_utf8_validate() and NUL characters)
Date: Sun, 12 Oct 2008 02:20:55 -0400

Freddie Unpenstein wrote:

Okay, to clarify one point. I was speaking more of NULL handling in general within Glib/GTK, rather than ONLY the one function named in the subject line. This thread of the discussion should have been fork()ed a few messages before I jumped in, I suppose.

From: "Behdad Esfahbod" , 12/10/2008 07:22

>> I believe, that differs from the UTF-8 specification ONLY in the
>> handling of the NULL byte, but then I've been avoiding dealing with
>> UTF-8 for the most part for exactly this reason. When UTF-8 is a strict
>> issue, I've been using higher-level scripted languages instead, that
>> already deal with it natively. (And I'm not 100% certain, but I think
>> that's essentially what they all do.)
> False. XML doesn't do such nasty stuff. HTTP doesn't either. HTML doesn't
> either. *Only* Java does. There's a reason standarrds are good, and there's
> a reason people use standards.

XML isn't processing the text, iterating over the text, etc. Neither does HTTP. It is appropriate for them to employ the UTF-8 standard, in all it's absolutely rock-solid glory. This isn't XML or HTTP I'm talking about, this is writing an application in the C programming language that may well be processing such. I'm not sure what your point was there, it seems off-topic to me.

You're wrong on the ONLY part, also. Java isn't the *Only* higher-level language around. And regardless, how a HLL wishes to store its strings is of no concern, as long as it does the right thing at the borders. The same goes for Glib/GTK. And THAT is the point of this portion of my argument. It's perfectly valid to have a not-quite-UTF-8 internal string format, as long as it's kept internal. If your producing output (not only in XML or HTML), than by all means do the _to_utf8 conversion. It should be REQUIRED anyhow, in this day and age!

You are absolutely right in one regard, though, standards ARE good. They also have their place, they're designed for a perfect, and they're not perfect. External data SHOULD be standards compliant. Due care IS required, a specification IS needed, and that specification MUST be upheld, for things to work. But, internal data does NOT have to be held to the same rigours as external data, and in many cases it is wrong to impose a perfectly correct external data standard on the internal representation of that data; it can quite readily be half a step to the left of the external standard, if doing so makes working with that standard easier and less confusing, especially for the less capable. Just look at network byte ordering. It is a standard, that doesn't mean that every program uses that byte ordering throughout internally. Well-written programs convert it at the borders, or at the least, guard it closely until they do.

Which is better for the application; RIGHT code that uses a slightly non-UTF-8 internal form, or WRONG code that tries to do the right thing but fails due to the added complexity and/or unexpected gotchas.

>> A "convert to UTF-8" function given a UTF-8 input with a 6-byte
>> representation of the character 'A' would store the regular single-byte
>> representation.
> False. It errs on an overlong representation of 'A'. If it doesn't, it's a bug.

Well, for one thing I was obviously speaking there of conversion rather than validation. For conversion, you MIGHT want to be strict. MOST however, will want to be as tolerant as they can of almost-right data. Better is to have the conversion function flexible, and if you're worried, validate it prior to conversion with something like g_utf8_validate(). An over-long representation of 'A' (I could imagine a sloppy UTF-16 program letting something like that through) sitting in a data file shouldn't break your program, unless there's an external contract stating that it should. If there is, g_utf8_validate() will uphold that contract just fine prior to conversion.

> You're totally missing the point. Allowing an alternate nul representation
> opens security problems much harder to track down than a crashing application.
> There's a lot of literature to read about this already. Not going to
> continue it here.

I've read a fair bit of such literature myself. Probably not quite so much as you, so I'm more than happy to be enlightened. But real \0 NULLs also introduce security problems that don't always crash a program. If one small function in a program decides to treat a NULL as a special character, even by mistake, then you've got a bit hard to find problem. Allowing alternate representations is the problem, demanding exactly one specific alternate representation is not.

I'll even repeat that part a little differently; the critical point that every single piece of literature I've read makes, is to have ONE definition of each character on the inside, and to properly control the entry and exit of that data. It also does NOT try to dictate WHAT that definition should be, only that it SHOULD avoid being something that has special magical meanings in unexpected cases. NULL qualifies a magical.

If my interpretation is wrong, please DO continue. Because I've heard a few others who appear to have the same interpretation of that literature as I do, if not the same conclusion about how it should be handled.

Fredderic

	River Rafting
	Get ready for wet and wild fun with a river rafting adventure! Click now!
	Click here for more information

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]