Re: g_utf8_validate() and NUL characters



From: "Nikolai Weibull", 09/10/2008 02:01:
> On Wed, Oct 8, 2008 at 13:20, Havoc Pennington <hp pobox com>; wrote:
>> Another way to put it, I don't think nul bytes are a user-explainable
>> concept. If anybody who isn't a programmer sees (how? what's the
>> glyph?) a nul byte in a _text_ file, that's just bizarre.
> How is "oh, you can't open /that/ file in a text editor because it has
> a character in it that isn't a user-explainable concept" (I'm not
> trying to make a straw man argument) better than simply opening the
> file, displaying the NUL as a box with 0000 in it (like Pango does for
> other characters it can't render) and be done with it? I don't see
> how it's the programs responsibility to state what can and what cannot
> be in a file the user wants to open, as long as the file is valid in
> the chosen encoding.

Why not just adopt the old thing of encoding NULLs and other non-UTF-8 characters as safe UTF-8 equivelants...? I've seen the practice of representing \0 as \UC080 (or however it's specified) recommended in a secure programming document as a measure for avoiding accidents (especially when you're using someone else's libraries), and plenty of other softwares and toolkits do it. C's use of NULLs is an implementation detail of C, it shouldn't be inflicted on everything else.

There's no need for every API function taking a text string (as opposed to Glib functions that may well be storing binary strings) to also have a version that takes a length, and for every string value throughout GTK to carry around a length value and all the extra work needed to work with length/buffer pairs over simple NULL-terminated strings. Especially when most of them don't handle binary anyhow.

Still doesn't answer the rendering issue, but personally, a NULL shouldn't have any special meaning in a string to be displayed. Whether it gets rendered as a box with 0's, or a zero width solid space, or whatever else, is another issue entirely. But it shouldn't require extra effort to handle it... Simply label it a binary character, and encode it up in the binary-to-UTF-8 functions. It can then be displayed however someone else decides, and be converted back into the original NUL by a UTF-8-to-binary function later on.


Fredderic
   Landscape Lighting
Click here to save on landscape lighting. Top brands.
Click here for more information
 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]