Re: Industry Thai Cell-Clustering Rules

From: Pablo Saratxaga <pablo mandrakesoft com>
To: gtk-i18n-list gnome org
Subject: Re: Industry Thai Cell-Clustering Rules
Date: Fri, 3 Nov 2000 21:12:24 +0100
Kaixo!

On Fri, Nov 03, 2000 at 10:29:55AM -0800, Chookij Vanatham wrote:
 
> Hi Pablo,
> 
> I'm glad that you join our discussion too.
> Here is my answer.

Thanks to the message from Theppitak Karoonboonyanan I has been able to
understand most of the problem.

> ] > The last two code sets are mapped to their own private area of Unicode and
> ] > cannot be used together.
> ] 
> ] Can't those be detected somehow ? (it would be interesting to have the
> ] list of combinations and the codepoints assigned to the precombined glyphs)
> 
> Not quite sure about the question, give me more detail.

If a given codepoint is only used in the Mac extension for exemple, we could
check for the availability of a glyph at that position to autodetect a
mac extension.

Anyway, having a table showing those extensions would help undertsanding it.

> ] > 	[Consonant + Tonemark] [Vowel] ----> Two cell clusters
> ] > ******
> ] 
> ] But that is not font-specific, is it ?
> 
> This is the point.

It was my impression, and the message from Theppitak Karoonboonyanan
confirmed it, that is not font specific at all.
The font differences only affect the aesthetical aspect of the rendering
(eg: nice placed tonemarks or "floating") but not the cell cluster order. 
(uneless some fonts have real precombined glyphs and use glyph substitution;
but Pango doesn't has OpenType support yet; and when it would have, then
the font substitutions should overrride anything so it isn't really
a problem, if I understand correctly)

> Here is the piece of Thai pango engine codes which are for determining,
> the Thai cell cluster.
... 
 
> This piece of code determines Cell-Clustering for Thai and, of couse,
> it doesn't use Wtt2.0 Cell-clustering logic.

It should be modified to implement Wtt2.0

> That's why, we put Cell-clustering to XLFD name,

as it isn't font dependent, it doesn't make sense to put that into the font
name.

> It's not really clear cut to say that Cell-clustering is not specific to
> font. Unfortunately, in the industry, there are more than 1 cell-clustering
> rule.

but only Wtt2.0 is the official standard, so we should support only that.
The fact that it allows to distinguish right and wrong order in typing is
an important feature that well overrides the few extra complexity of the
render.
(note: latin typing with dead keys is a bit similar in that aspect:
dead_circumflex + e will give ê, but e + dead_circumflex will give e^
it is not exactly the same, as ê is one char while e^ are two chars; for
Thai the same chars are involved; but the idea of visually seeing a wrong
sequence is important, as only the use of the right sequences allow computer
treatment of text (search, sorting, spell checking, etc))  

> That's right. We should let users to choose which one they want.

I don't think that anymore; I think that only the official ne should be
implemented; the existence of the others seems to be only the result of
the incapabaility or unwillingness to implement Wtt2.0
If however some other scheme has a real interest and users want it, it should
be added; but there doesn't seem to be any evidence it would be the case.

[problems with legacy fonts and clustering rules]
> ] >    We won't be able to have Thai display correctly after we do text
> ] >     manipulation, like, insert, delete, copy-paste, selection, scrolling,
> ] 
> ] Because the copied string into the buffer uses the non tis-620 codepoint of
> ] the precomposed glyphs or because of a cursor positionning problem?
> 
> Let's say if those software don't concern about cell-clustering but use
> Thai font whose tonemark/vowel are zero-width space. There are a lot of
> incorrect behavior that are not able to be accepted for sure.

Yes, but that is independent of the font extensions and even of the clustering
rules (only that the Wtt2.0 makes it easier for wrong sequences)

> Let's say A - consonant, B - vowel, C - tonemark.
> 
> 			C
> 			B
> 			A
> 			
> B and C are combined and displayed on top of A (because zero-width).
> Think about using the same logic to move I-Beam, you need to type arrow-key
> left or right 3 times to move I-beam to the left or the right.
> Doesn't it look weird ? Why do you need to type arrow-key 3 times ?

It doesn't look weird to me.

> The problem will happend among insert/delete/copy-paste/selection/scrolling
> if the software don't consider Cell-clustering.

Yes.
That is a probleme indeed.
I think the cell clusters must really be considered as a block, and don't
allow to insert or delete inside of it. That is, in the byte sequence:

 x x x x x A B C y y y y y

insertion should be possible only before A or after C.
selection should include A B C or none of them, but never only one of them.
deleting with backspace or del should start by deleting first C then B
then A.

Is that a logical approach for a Thai person ?


Note that this problem is not limited to Thai, but is also the same for
all indic scripts, Lao, Khmer, etc. and also Arabic and Hebrew.
(maybe also Korean hangeul ? I think the way it works in Korean is that once
the conjunct is done, it has its own value and is selected/deleted simply
as a singel char (same thing with latin/cyrillic/greek scripts; 
eg: ê (e circumflex) is a single char; you cannot select/delete only the 'e'
or 'only the '^')).

So the same approach is needed.

Note that the only difference here between Wtt2.0 and other cell clustering
would be for wrong sequences: with other cell clusterings as they make a
single cluster you treat them as normal clusters; with Wtt2.0 if you have a
wrong sequence A C B (instead of right A B C) it will display as (I'll use '-'
to link together cluster elements):   A-C B   so you can put the cursor
between C and B, and insert things, or delete B (with Delete) or delete C 
(with BackSpace). So to correct ACB into ABC you must do:

you have: A-C B
place cursort (x): A-C x B
press BackSpace: A x B
Type B: A-B x B
then C: A-B-C x B
delete (with Delete) the extra B: A-B-C

while when using a non-Wtt2.0, first you won't see that there is an error :-(
then to correct it you must do:

you have: A-C-B
place cursor: A-C-B x  (or x A-C-B)
delete B (with BackSpace): A-C x (or with Delete: x A-C)
delete C (with BackSpace): A x (or with Delete: x A)
type B: A-B x  (if you were at left of A, first press "right arrow") 
type C: A-B-C x


Note that indic scripts (devanagari, etc) (and Khmer I think) are inherently
like Wtt2.0, as only correct order can produce the conjuncts (what you type
is letters, not shapes).
Lao, on the other hand, is like Thai it seems (but I would need confirmation),
and would then need a similar clustering (that is; the clustering engine
should not be part of Thai module, but of Pango; then the Thai/Lao modules
call it and pass it a set of data with wich the endjine will be able to
calculate the clusters.
For Arabic and Hebrew, I think it would be better to threat them the same
way as latin script: once the conjunct is created, consider it as a single
char (that is made easier by the fact that the composed conjuncts have
unicode values).

> Hope this would help.
> 
> 
> Chookij V.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/		PGP Key available, key ID: 0x8F0E4975
References:
- Re: Industry Thai Cell-Clustering Rules
  - From: Chookij Vanatham
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]