Re: g_utf8_validate() and NUL characters



Den Tue, 07 Oct 2008 16:55:29 -0400 skrev Behdad Esfahbod:

> coda wrote:
>> I discussed this on #gtk+ with mathrick and pbor and it seems that the
>> assumption that UTF-8 strings are NUL-terminated and contain no NULs
>> runs pretty deep. A possible solution is to use "modified UTF-8" (
>> http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 ) which represents
>> U+0000 as the two-byte sequence 0xC0 0x80, normally illegal in standard
>> UTF-8, but using the normal decoding algorithm, represents U+0000.
> 
> I believe the NUL bytes are the smallest of your problems, and can be
> fixed in GTK+.

I think it's not so small a problem; as you note below, the assumptions 
about what a string is and what the length parameter means are widely 
inconsistent and incompatible (not to mention the large parts of API that 
don't even pass / accept length, those need fixing all the more). It 
looks like a fair chunk of crap-diving work to do it properly, and then a 
bunch of APIs would need deprecating. Not rocket science, but tedious and 
with backcompat implications.
 
> And there's g_utf8_validate() that interprets it as:
> 
>   - If -1, str is nul-terminated.  Otherwise, length is the length in
>   bytes of
> str.  Str should not be nul-terminated in the first length bytes.
> 
> Ugh.  Why is that?  Who knows?  Matthias suggested that because a string
> claiming to be length bytes long but terminating prematurely is not
> valid. However, that statement assumes that string is nul-terminated.

That is indeed very ugly, and seems to me like a particularly unfortunate 
case of implementation guts spilling out onto the public API and then 
getting documented as invariants :(. Personally I can't think of a single 
reason not to allow NULs if that was the consistent design decision 
across the stack; \000 is no more invalid than \001 or \007 are, and we 
already handle these. Moreover, I consider not handling NULs a seriously 
ugly bug that needs fixing for a number of user-visible reasons; the 
original gedit bug coda mentions is one of the more duped ones, and 
probably the one that hits me the most. The inability to look at the 
bytes to see why exactly gedit claims the encoding is invalid is 
infinitely frustrating in the catch-22 way.

> So yeah, it's all a mess.  I like to somehow clean the mess, but it may
> have to wait till glib 3.0...  I don't think the implications of the
> changes will be very catastrophic, but can't know without extensively
> going over all uses in all projects...  Some Google Code voodoo may help
> us get a rough feeling of the odds.

We can start the deprecation work in 2.x, tho.

>> but functions
>> that return a gchar* with no length output parameter, like
>> gtk_text_buffer_get_text(), would require replacements.
> 
> Yes.
> 
>> Another possibility mentioned was making more use of GString.
> 
> Not a huge fan.

Why's that? GString is a very odd animal, we have it, it works fine and 
is as good a string implementation and as compatible with char* as 
possible within C, yet it seems to be used exactly nowhere. What's the 
reason not to use GStrings if they do just what's needed? Or why do we 
have them if no-one wants to use them?

Cheers,
Maciej




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]