Re: g_utf8_validate() and NUL characters
- From: Maciej Katafiasz <mathrick gmail com>
- To: gtk-devel-list gnome org
- Subject: Re: g_utf8_validate() and NUL characters
- Date: Thu, 9 Oct 2008 00:01:44 +0000 (UTC)
Den Tue, 07 Oct 2008 16:55:29 -0400 skrev Behdad Esfahbod:
> coda wrote:
>> I discussed this on #gtk+ with mathrick and pbor and it seems that the
>> assumption that UTF-8 strings are NUL-terminated and contain no NULs
>> runs pretty deep. A possible solution is to use "modified UTF-8" (
>> http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 ) which represents
>> U+0000 as the two-byte sequence 0xC0 0x80, normally illegal in standard
>> UTF-8, but using the normal decoding algorithm, represents U+0000.
>
> I believe the NUL bytes are the smallest of your problems, and can be
> fixed in GTK+.
I think it's not so small a problem; as you note below, the assumptions
about what a string is and what the length parameter means are widely
inconsistent and incompatible (not to mention the large parts of API that
don't even pass / accept length, those need fixing all the more). It
looks like a fair chunk of crap-diving work to do it properly, and then a
bunch of APIs would need deprecating. Not rocket science, but tedious and
with backcompat implications.
> And there's g_utf8_validate() that interprets it as:
>
> - If -1, str is nul-terminated. Otherwise, length is the length in
> bytes of
> str. Str should not be nul-terminated in the first length bytes.
>
> Ugh. Why is that? Who knows? Matthias suggested that because a string
> claiming to be length bytes long but terminating prematurely is not
> valid. However, that statement assumes that string is nul-terminated.
That is indeed very ugly, and seems to me like a particularly unfortunate
case of implementation guts spilling out onto the public API and then
getting documented as invariants :(. Personally I can't think of a single
reason not to allow NULs if that was the consistent design decision
across the stack; \000 is no more invalid than \001 or \007 are, and we
already handle these. Moreover, I consider not handling NULs a seriously
ugly bug that needs fixing for a number of user-visible reasons; the
original gedit bug coda mentions is one of the more duped ones, and
probably the one that hits me the most. The inability to look at the
bytes to see why exactly gedit claims the encoding is invalid is
infinitely frustrating in the catch-22 way.
> So yeah, it's all a mess. I like to somehow clean the mess, but it may
> have to wait till glib 3.0... I don't think the implications of the
> changes will be very catastrophic, but can't know without extensively
> going over all uses in all projects... Some Google Code voodoo may help
> us get a rough feeling of the odds.
We can start the deprecation work in 2.x, tho.
>> but functions
>> that return a gchar* with no length output parameter, like
>> gtk_text_buffer_get_text(), would require replacements.
>
> Yes.
>
>> Another possibility mentioned was making more use of GString.
>
> Not a huge fan.
Why's that? GString is a very odd animal, we have it, it works fine and
is as good a string implementation and as compatible with char* as
possible within C, yet it seems to be used exactly nowhere. What's the
reason not to use GStrings if they do just what's needed? Or why do we
have them if no-one wants to use them?
Cheers,
Maciej
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]