Re: [gtk-i18n-list] Unicode PUA supporting issue in gtk+/pango


On Tue, 20 Dec 2005 14:26:08 +0800
Chia-I Wu <b90201047 ntu edu tw> wrote:
>> My anxiety is that: if we write a documentation including PUA
>> charcode today, and read it after the official inclusion of the
>> characters... we cannot search a string without the extra mapping
>> table of PUA code and Unicode codepoint. And, we need a switch

>I think it can be a feature of a software for the HK users.  It's like,
>for example, I have a document with mixed traditional and simplified
>Chinese.  When I search for the character U+967D ("Sun" in traditional
>Chinese) , I would like to see that U+9633 ("Sun" in simplified Chinese)
>is also searched.  The mappings are too complicated and too specific to
>be included in a base library.  (Or maybe not?)

Thank you for giving concrete example, please let
me ask about the case of Hanzi migration from PUA
to official Unicode inclusion.

Although CJK people know the difference between
U+967D and U+9633 is just their presentation forms,
the meaning is same. But Unicode define them as
different characters. So, base layer of Unicode
text handling should deal them as different Hanzi.

But, this case, the relashionship between Hanzi in
PUA codepoints and defined codepoints (in revised
Unicode after inclusion of new Hanzi) is not clear.

Excuse me, let me explain peskily. I think it's helpful
for non-CJK people to join discussion.


GB18030-2000 has several punctuations in vertical forms,
and provides codepoint mapping from GB18030-2000 codepoint
to Unicode-3.0 codepoint. Some vertical glyphs are already
included in Unicode's official CJK compatibility area
(U+FE30 - U+FE4F), but others are not included yet, so
GB18030-2000 map them to Unicode-3.0 PUA codepoints.
For example, GB+A6D9 was mapped to U+E78D.

Since Unicode-4.1, the left vertical glyphs of GB18030-2000
are included in Unicode's official vertical form area
(U+FF10 - U+FE1F).
For example, GB+A6D9 is now mapped to U+FE10.

Now I have a question. Today, we have Unicode-4.1.0.
The character at PUA codepoint U+E78D is same with U+FF10?
They are different? Now U+E78D should be dealt as unmapped?
or should be kept for backwards compatibility?
The decision should be done by font selection only?



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]