Re: g_malloc overhead



On Mon, Jan 26, 2009 at 12:57:28PM -0500, Owen Taylor wrote:
> On Mon, 2009-01-26 at 18:30 +0100, Martín Vales wrote:
> > Yes, i only talked about the overhead with utf8 outside of glib, only that.
> > Perhaps the only solution is add more suport to utf16 in glib with more 
> > methods.
> 
> There's zero point in talking about a "solution" until you have profile
> data indicating that there is a problem.

Indeed. UTF-16 is horribly broken by design, and any attempt made to
migrate in the direction _towards_ it is a flawed one, and should be
avoided.

UTF-8 is backward-compatible with the legacy str*() functions in C,
which, like it or not, will be around for a while yet. 

 * It makes sure not to embed any ASCII NUL ('\0') in the stream unless
   it means it, as U+0000, which makes it work with these functions. 
   
 * UTF-8 has nice properties in substring matches - grep can work on
   UTF-8 despite not knowing it, because no valid UTF-8 string ever appears
   falsely as a substring of another.

 * This also means that the only occurance of '\n' in a UTF-8 stream is
   a real one. This means that cat, head/tail, awk, etc... can properly
   detect where the linefeeds are. 'head' can print, say, the first 3
   lines of UTF-8 text without knowing it's UTF-8.

 * UTF-8 can be sorted by only sorting the encoded bytes. sort can sort
   a UTF-8-encoded text file. The order of the Unicode strings, is the
   same as the bytewise-sorted order of the raw bytes that encode it.

This list goes on.


Meanwhile, on the other end of the spectrum, storing Unicode data as
decoded 32bit integers makes some sense. It means string indexing
operations are constant-width - the substring between the 4th and 9th
characters in such an array will be known to lie between the 16th and
36th bytes. The presence of combining characters, and double-width
glyphs does make this transformation a bit harder, effectively reducing
the advantage such a scheme has.


Compared to that, UTF-16 offers NONE of these advantages. UTF-16 cannot
be passed through any legacy str*() function, nor will it work in grep,
sed, awk, cut, sort, head, tail, or in fact _any_ of the standard UNIX
text tools. Nor can UTF-16 be array indexed in constant time, because of
the surrogate pairs used to encode codepoints outside of the BMP (Basic
Multilingual Plane).


In Summary - UTF-16. Don't. Just Don't.

-- 
Paul "LeoNerd" Evans

leonerd leonerd org uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

Attachment: signature.asc
Description: Digital signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]