Re: Updating gtkimcontextsimple.c (bug #321896)

From: Simos Xenitellis <simos lists googlemail com>
To: Samuel Thibault <samuel thibault ens-lyon org>
Cc: gtk-i18n-list gnome org
Subject: Re: Updating gtkimcontextsimple.c (bug #321896)
Date: Wed, 20 Feb 2008 16:44:29 +0000

And an article to get things going,

http://blogs.gnome.org/simos/2008/02/20/keyboard-layout-for-combining-diacritics/

Simos

Simos Xenitellis wrote:

Hi Samuel,

I had some discussions on this and I think the problem can be resolved
in the following way.

To add combining diacritics there is no need for extra support in
GTK+; this is something that is handled by the keyboard layouts (which
are not handled by GTK+).
What that means is that you need a keyboard layout that produces all
those combining diacritics.

The project for the keyboard layouts is xkeyboard-config,
http://freedesktop.org/wiki/Software/XKeyboardConfig

For your case of Tagbanwa, you would create a new keyboard layout.
For the generic case to add combining diacritics to different
characters, a catch-all keyboard layout could be used.

Currently, there is no GUI tool to create such keyboard layouts.
In your Linux system, keyboard layouts live in /etc/X11/xkb/symbols/
You can have an idea how to modify an existing layout by looking into the files.

If you would like to pursue this further, I would be happy to give you
instructions.

Simos

On Feb 7, 2008 11:19 AM, Samuel Thibault <samuel thibault ens-lyon org> wrote:

Hello,

Simos wrote:

In bug #341341, Danilo talks about support for compose sequences that
produce more than one Unicode characters, as in
COMBINING ACUTE + CYRILLIC LATIN A where no precomposed form exists.
At the moment, the Xorg Compose file does not have such compose
sequences. If we were to implement in GTK+, I would suggest to build up
a new table of the form

dead_acute, A, E, H, I, O, U, ...  (assume all these cyrillic)
dead_diaeresis, A, E, H, I, O, U, ...  (assume all these cyrillic)

The problem is that this is very tedious for people who already have a
hard time making Linux suit to their language (fonts, messages, locales,
...) and can potentially be very big. For instance in vietnamese you may
need to put two accents on a voyel, and so you'd need to enumerate all
such possible combinations.

In check_algorithmic, we currently check if the compose sequence can be
normalised to a single Unicode character.

Which is necessary for proper string unicity/comparison etc, yes.

So, here we can also check if the compose sequence matches the "valid"
compose sequence (a cyrillic small 'a' with a combining acute is ok)

There is no such thing as a "valid" compose sequence. As Unicode says,

"All combining characters can be applied to any base character and can,
in principle, be used with any script. As with other characters, the
allocation of a combining character to one block or another identifies
only its primary usage; it is not intended to define or limit the range
of characters to which it may be applied.  In the Unicode Standard, all
sequences of character codes are permitted.

This does not create an obligation on implementations to support all
possible combinations equally well. Thus, while application of an
Arabic annotation mark to a Han character or a Devanagari consonant is
permitted, it is unlikely to be supported well in rendering or to make
much sense."

So there are indeed combinations that don't make so much sense, but
enumerating those that make looks to me unnecessary work:

- It may be potentially very big, just see all the possible vietnamese
  combinations.
- It will mostly never be complete, there will always be a language
  (say, for instance, tagbanwa) which nobody takes care of.
- Why limiting ourselves like this? It has been objected that a generic
  support potentially leads to "odd" things like n̈̈̈, which is an n
  with three diaeresis on it.  I don't think this is odd: if the user
  pressed the dead_diaeresis key several times, I guess he indeed wanted
  to have three diaeresis, and if they don't show up, then the text
  rendering engine is probably broken and may not for instance properly
  show ẫ, which is needed for vietnamese (actually, on my system,
  pango shows both fine).  Actually I think some mathematicians may even
  have a use for n with several diaeresis :)

How would we know which compose sequences are "valid"? We can parse
parts of ftp.unicode.org/Public/UNIDATA/NormalizationTest.txt

It is _not_ a table of "valid" characters, it is only a partial test
to check that the algorithm which transforms character + combining
character into normalized precomposed form works correctly. Actually,
a table that would hold _all_ the valid combinations would be very
big. Just for the vietnamese language, there would be 10*6 entries.

Instead, it could be solved once for all by systematically turning
<dead_foo> <bar>, <combining_foo> <bar> and <Multi_key> <foo> <bar> into
"Ubar Ucombining_foo". The only limitation is the font rendering engine,
which seems to already do a pretty good job in all the cases: if I try
to put a tagbanwa accent on a latin accent, it just works. If I try to
put a combining kannara vocalic on a kannara character to which it isn't
supposed to apply, it just shows the character and then the combining
vocalic with a dotted circle.

If the implementation can be generic enough that it works ASAN for every
languages in the world without more work, then why not do it?

Samuel

References:
- Re: Updating gtkimcontextsimple.c (bug #321896)
  - From: Samuel Thibault
- Re: Updating gtkimcontextsimple.c (bug #321896)
  - From: Simos Xenitellis

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]