Hello,
Simos wrote:
In bug #341341, Danilo talks about support for compose sequences that
produce more than one Unicode characters, as in
COMBINING ACUTE + CYRILLIC LATIN A where no precomposed form exists.
At the moment, the Xorg Compose file does not have such compose
sequences. If we were to implement in GTK+, I would suggest to build up
a new table of the form
dead_acute, A, E, H, I, O, U, ... (assume all these cyrillic)
dead_diaeresis, A, E, H, I, O, U, ... (assume all these cyrillic)
The problem is that this is very tedious for people who already have a
hard time making Linux suit to their language (fonts, messages, locales,
...) and can potentially be very big. For instance in vietnamese you may
need to put two accents on a voyel, and so you'd need to enumerate all
such possible combinations.
In check_algorithmic, we currently check if the compose sequence can be
normalised to a single Unicode character.
Which is necessary for proper string unicity/comparison etc, yes.
So, here we can also check if the compose sequence matches the "valid"
compose sequence (a cyrillic small 'a' with a combining acute is ok)
There is no such thing as a "valid" compose sequence. As Unicode says,
"All combining characters can be applied to any base character and can,
in principle, be used with any script. As with other characters, the
allocation of a combining character to one block or another identifies
only its primary usage; it is not intended to define or limit the range
of characters to which it may be applied. In the Unicode Standard, all
sequences of character codes are permitted.
This does not create an obligation on implementations to support all
possible combinations equally well. Thus, while application of an
Arabic annotation mark to a Han character or a Devanagari consonant is
permitted, it is unlikely to be supported well in rendering or to make
much sense."
So there are indeed combinations that don't make so much sense, but
enumerating those that make looks to me unnecessary work:
- It may be potentially very big, just see all the possible vietnamese
combinations.
- It will mostly never be complete, there will always be a language
(say, for instance, tagbanwa) which nobody takes care of.
- Why limiting ourselves like this? It has been objected that a generic
support potentially leads to "odd" things like n̈̈̈, which is an n
with three diaeresis on it. I don't think this is odd: if the user
pressed the dead_diaeresis key several times, I guess he indeed wanted
to have three diaeresis, and if they don't show up, then the text
rendering engine is probably broken and may not for instance properly
show ẫ, which is needed for vietnamese (actually, on my system,
pango shows both fine). Actually I think some mathematicians may even
have a use for n with several diaeresis :)
How would we know which compose sequences are "valid"? We can parse
parts of ftp.unicode.org/Public/UNIDATA/NormalizationTest.txt
It is _not_ a table of "valid" characters, it is only a partial test
to check that the algorithm which transforms character + combining
character into normalized precomposed form works correctly. Actually,
a table that would hold _all_ the valid combinations would be very
big. Just for the vietnamese language, there would be 10*6 entries.
Instead, it could be solved once for all by systematically turning
<dead_foo> <bar>, <combining_foo> <bar> and <Multi_key> <foo> <bar> into
"Ubar Ucombining_foo". The only limitation is the font rendering engine,
which seems to already do a pretty good job in all the cases: if I try
to put a tagbanwa accent on a latin accent, it just works. If I try to
put a combining kannara vocalic on a kannara character to which it isn't
supposed to apply, it just shows the character and then the combining
vocalic with a dotted circle.
If the implementation can be generic enough that it works ASAN for every
languages in the world without more work, then why not do it?
Samuel