Re: Multibyte improvement - g_unichar_to_utf8()

From: Yukihiro Nakai <ynakai redhat com>
To: Jody Goldberg <jgoldberg home com>
Cc: gnumeric-list gnome org, havill redhat com
Subject: Re: Multibyte improvement - g_unichar_to_utf8()
Date: Wed, 3 Jan 2001 20:10:25 +0900

On Mon, 1 Jan 2001 17:09:50 -0500
Jody Goldberg <jgoldberg home com> wrote:

On Sun, Dec 31, 2000 at 07:06:21AM +0900, Yukihiro Nakai wrote:


g_unichar_to_utf8() function in print-cell.c is broken in multibyte
environment because it doesn't any multibyte handling.

I wrote a new g_unichar_to_utf8() func that can handle multibytes
correctly in multibyte environments with gint32.
In EUC-JP, 2 byte is normal but some other codeset uses 4 bytes.


The comment in src/print-cell.c says
    'This is cut & pasted from glib 1.3'

If your replacement is better it should go into glib-1.3 and
gnumeric.  I'll wait for someone more experienced in these details
to make this decision.


http://cvs.gnome.org/bonsai/cvsblame.cgi?file=glib/gutf8.c&rev=&root=/cvs/gnome

The g_unichar_to_utf8() function in glib 1.3 seems to convert from ISO10646 char
to UTF-8 char (First arg is gunichar, == guint32). But g_unichar_to_utf8()
in gnumeric is used to convert locale-dependent chars to UTF-8 char.

So it will cause no error in glib, but do in gnumeric.

I used #ifdef linux macro because in *BSDs don't have

I'd prefer to see #ifdef HAVE_LANGINFO_H than #ifdef linux

In this example, I use 'ABC' in EUC-JP multibyte and 'ABC' in ASCII.
Below it the char codes for your sake:

   |  ASCII(UTF-8)       EUC-JP   UTF-8(multibyte)
---+----------------------------------------------
A  |          0x41    0xa3 0xc1    0xef 0xbc 0xa1
B  |          0x42    0xa3 0xc1    0xef 0xbc 0xa2
C  |          0x43    0xa3 0xc1    0xef 0xbc 0xa3


This confuses me.
1) It seems as if A == B == C in the EUC-JP case.
2) where can I find some documentation on UTF-8 vs UTF-8(multibyte) ?


Oops. It's a mistake.
1)
    |  ASCII(UTF-8)       EUC-JP   UTF-8(multibyte)
 ---+----------------------------------------------
 A  |          0x41    0xa3 0xc1    0xef 0xbc 0xa1
 B  |          0x42    0xa3 0xc2    0xef 0xbc 0xa2
 C  |          0x43    0xa3 0xc3    0xef 0xbc 0xa3

2)
                        ASCII                 UTF-8
    Single byte A :      0x41    <->           0x41


    EUC-JP 'A'    : 0xa3 0xc1    <-> 0xef 0xbc 0xa1

The EUC-JP 'A' is the double with char 'A', as same as
Japanese character.See the sample mbstr.png, first 'ABC'
is in ASCII, and last 'ABC' is in EUC-JP.

The ASCII 'A' share same char code in both ASCII and UTF-8.
But EUC-JP 'A' is 2 byte in EUC-JP and 3 bytes in UTF-8.
You can make sure with the iconv command of what code in
EUC-JP will be converted in what in UTF-8.

---
Yukihiro Nakai, Red Hat Japan, Development

Attachment: mbstr.png
Description: PNG image

References:
- Re: Multibyte improvement - g_unichar_to_utf8()
  - From: Jody Goldberg

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]