Re: Glib::ustring::iterator not iterating over chinese character

From: Daniel Elstner <daniel kitta googlemail com>
To: Weimin Xie <panyuweimin hotmail com>
Cc: gtkmm-list gnome org
Subject: Re: Glib::ustring::iterator not iterating over chinese character
Date: Fri, 01 May 2009 02:06:47 +0200

Am Montag, den 13.04.2009, 18:47 +0800 schrieb Weimin Xie:

> I'm learning how to use Glib::ustring. My goal is to split an ustring
> of unicode character into a vector container. In a simple case, my
> program have read a Chinese character, for example, "你". When I tried
> to use the Glib::ustring::iterator to go over the ustring, it shows
> there are more than one entry. 
> 
> If description = "你", then
[...]
> Gives me 
> size <5> bytes <8> char <228> char <189> char <160> char <10> char
> <10>

To me, this looks suspiciously like something that would happen if a
string gets encoded twice.  That is, I suspect you already had a UTF-8
encoded string, which subsequently got interpreted as a string of
ISO-8859-1 bytes and then translated a second time to UTF-8.

With just one code point (你) plus the two trailing newline characters,
the output for size should have been 3 instead of 5.  And the number of
bytes should have been 5 rather than 8.  The interpretation of a UTF-8
string as ISO-8859-1 would also explain why you see exactly the numbers
you would see if you were iterating over the bytes of the correctly
encoded original string -- that's because up to code point 255, Unicode
is identical to ISO-8859-1.

> Can someone please explain why the iterator doesn't go over the
> unicode characters as expected?

It probably does.  It's just that your string doesn't contain what you
think it does.

> Thanks a lot in advance!

You're welcome.  If you still think it's a problem of glibmm, please
file a bug and attach a test case, so we can reproduce the problem.

--Daniel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]