Re: UTF-8: Case mapping

From: Pablo Saratxaga <pablo mandrakesoft com>
To: gtk-i18n-list gnome org
Subject: Re: UTF-8: Case mapping
Date: Thu, 28 Jun 2001 16:10:18 +0200
Kaixo!

On Thu, Jun 28, 2001 at 10:21:13PM +1000, Raymond Wan wrote:
 
> > I wonder... the GNU libc has very complex and comprehensive per language
> > (per locale even) sorting rules; and, from the source files at least,
> 
> 	Just wondering as I don't know the GNU libc very well, but would
> libc be able to handle variable length bytes for a given character?

maybe calling mbtowc() would be needed; but I'm not even sure it is needed
(as you can have already multi-char sorting cases, eg for traditional
spanish sorting: a,b,c,ch,d,... and it works. so multibyte should work too.
That being said, I never tested either)

> example, I presume Japanese and Chinese C libraries would sort assuming
> the input was two byte strings (S-JIS for Japanese and Big5 for Chinese)

On GNU/Linux at least Shift-JIS is, thanks God, never used. EUC-JP is,
and it is multibyte.

The GNU libc is supposed to correctly handle multibyte; there may still be
bugs or things maybe not completly implemented, but if that is the case, it
is a temporary situation, the goal is clearly to allow correct handling of
all those situations.
IMHO the problem would more likely be with other libc implementations.

> 	I partly agree with your message, but for the most part, I also
> don't understand the problem at hand very well.

That is why I ask for a better explanation of it at first :)

> As UTF-8 is not used fully yet

Technically UTF-8 or EUC-JP pose the same kind of problems.
The main difference is that UTF-8 is a comprehensive encoding, able to
englobe all other used encodings (standard or not, widely used or not),
and so, it is a convenient candidate for internal use when information must
be kept (conversion to local encoding can loose some info). UCS2 or UCS4
are also good candidates. The advantage of UTF-8 is that it is byte
compatible and ascii compatible; that is a string in utf-8 can be
easily displayed without any major change to programs, printf() will
work fine for example (well, as long as you restrict to simple scripts,
LTR and non combining).
So, UTF-8 is the preferred encoding for manipulating text data on future
Gtk/pango/gnome. And, in order to preserve a losless and comprehensive
internal encoding and the ability to let the user use a different locale
encoding, several functions are doubled, one for locale dependent encoding
and another for utf8.
But from a technical point of vue the handling of UTF-8 or EUC is similar
(UTF-8 is a sort of "extended EUC", it follows the same principles)

> and will be used more often in the next few years, it's hard to
> predict what a typical user's needs will be.

Well, not exactly.
The needs will be the same as now with locale encodings; plus the need
to convert between locale encoding and utf-8 (that is already covered
with nice functions on glib btw).
AS utf-8 is used internally, there is also the need to accept any kind of
keyboard input, regarless of locale (that is possible in recent XFree86,
look at xterm for example) and the ability to do utf-8 cut and paste (idem),
the ability to display it (possible with pango), etc.
The ability to display means also there is the need to sort strings (for
lists for example). So there is a need t osort utf-8 strings.

Then the discussion has become somewhat confuse (at least for me).

> 	I think having some basic sorting for 2.0 (i.e., primitives for
> users to build on) and waiting until UTF-8 catches on to see what is
> popular sounds like one valid idea to me... 

No, it isn't.
If utf-8 is used internally, the sorting must be done in utf-8.

What I don't understand is what the problem is.
Why can't the existing strcoll() be used for that(eg something like:

int utf8_strcoll(char *s1, char *s2)
{
	char *locale, *tmp, *p;
	int val;

	locale = setlocale(LC_COLLATE, "");
	tmp = g_strndup(locale,5);	
	tmp = g_strcat(tmp,".UTF-8");
	setlocale(LC_COLLATE, tmp);
	
	int = strcoll(s1, s2);
	
	setlocale(LC_COLLATE, locale);
	g_free(tmp);
	return val;
}



> 
> Ray
> 
> 

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/		PGP Key available, key ID: 0x8F0E4975
References:
- Re: UTF-8: Case mapping
  - From: Pablo Saratxaga
- Re: UTF-8: Case mapping
  - From: Raymond Wan
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]