dead keys and unicode combining accents


In Bugs 345254 and 341341, it has been requested that using dead keys,
combining keys or compose keys produce unicode combining marks so as to
support accented letters for some languages.

The usual approach has always been to use a compose file which contains
a big list of e.g.

<dead_acute>             <Cyrillic_i>  : "и́"
<combining_acute>        <Cyrillic_i>  : "и́"
<Multi_key> <acute>      <Cyrillic_i>  : "и́"
<Multi_key> <apostrophe> <Cyrillic_i>  : "и́"

However, this is very tedious for people who already have a hard time
making Linux suit to their language (fonts, messages, locales, ...) and
can potentially be very big (e.g. for vietnamese there are roughly 10*6
combinations of voyels and accents, for which all of dead, combining and
Multi_key approaches should be available).

Instead, it could be solved once for all by automatically turning
<dead_foo> <bar>, <combining_foo> <bar> and <Multi_key> <foo> <bar> into
"Ubar Ucombining_foo".

It has been objected that this would potentially lead to "odd" things
like n̈̈̈, which is an n with three diaeresis on it.  I don't this
is odd: if the user pressed the dead_diaeresis key several times, I
guess he indeed wanted to have three diaeresis, and if they don't show
up, then the text rendering engine is probably broken and may not for
instance properly show ẫ, which is needed for vietnamese (actually,
on my system, pango shows both fine). Actually I think some
mathematicians may even have a use for n with several diaeresis :)

Now, of course there are the normalization algorithm and precomposed
forms, which should be applied in order to get unicity of string
equality. But as bug 341341 highlights, for some languages (actually,
for most languages) unicode doesn't have all needed precomposed forms,
just because that would make the unicode space extremely big, and so
decomposed forms _need_ to be supported, and I see no reason why to
limit the support of them to some explicit list while it could be
handled in a generic way. As Unicode says,

"All combining characters can be applied to any base character and can,
in principle, be used with any script. As with other characters, the
allocation of a combining character to one block or another identifies
only its primary usage; it is not intended to define or limit the range
of characters to which it may be applied.  In the Unicode Standard, all
sequences of character codes are permitted.

This does not create an obligation on implementations to support all possible
combinations equally well. Thus, while application of an Arabic annotation mark
to a Han character or a Devanagari consonant is permitted, it is unlikely to be
supported well in rendering or to make much sense."

But if the implementation can be generic enough that it works, then why
not do it?


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]