Re: Industry Thai Cell-Clustering Rules


I seemed to miss a very important discussion last night. :-)

On Thu, 2 Nov 2000, Chookij Vanatham wrote:

> K.Theppitak and other Thai folks,
> Let's try to finalize various Thai Cell-Clustering Rules in the industry
> so that we can keep them as standard and having them defined in the
> encoding filed of the XLFD.
> Let's start with the following info from K.Theppitak.
> According to our experience, there are three different practices of Thai
> fonts for rendering :
> 1. Plain tis620 : combining characters are placed at the safe positions to
>    prevent collapsion. There are two practices of this kind :
>    - negative-offset-zero-width diacritics (this makes the fonts apply to
>      many applications, such as Netscape, which support Western fonts
>      without knowing they are rendering Thai fonts)
>    - real monospace fonts (used in mule/emacs; this requires the
>      applications to combine characters into cells)
> 2. MacThai extension : an extended tis620 code set, by using codes in the
>    range 0x80-0x9f and in some free slots to keep the prepositioned
>    combining characters. This needs a shaping algorithm to produce 
>    elegant rendering.
> 3. WindowsThai extension : similar to MacThai extension, but used in
>    Windows Thai Editions.
> The last two code sets are mapped to their own private area of Unicode and
> cannot be used together.
> So, we are now discussing about a convention on the encoding field on XLFD
> to distinguish the three code sets :-
>   -tis620-0   for plain tis620
>   -tis620-1   for MacThai extension
>   -tis620-2   for WindowsThai extension
> Note that the years in the registry field are omitted, because tis620.2529
> and tis620.2533 do not differ in content. Both can be referred to as
> tis620 without confusion.

Note that this convention is proposed by a group of Thai developers.
It will be proposed in different open source communities to be adopted
consistently, for the profit of Thai users.

> Now, we need to talk about Cell-clustering rules which are going to be used
> with fonts.
> As far as I know, there is only 1 cell-clustering rule defined from Thai
> government (by NECTEC). This one is called Wtt2.0 and the detail is attached.
> We should add the word "wtt2.0" to any names if they are using Wtt2.0 cell
> clustering rule.
> Ex:
>   -tis620-0.wtt2.0 for plain tis620 with Wtt2.0 cell clustering rule
>   -tis620-1.wtt2.0 for MacThai extension with Wtt2.0 cell clustering rule
>   -tis620-2.wtt2.0 for WindowsThai extension with Wtt2.0 cell clustering rule

In my opinion, the need for the ".wtt2.0" extension is quite doubtful,
because WTT 2.0 is not dependent on font encoding variations.

Let me say:
- tis620-1 and tis620-2 are proper supersets of tis620-0.
- So, any rendering engine that works with plain TIS 620 (i.e. tis620-0)
  can be applied to all above categories of tis620 registry. And WTT 2.0
  provides one of such engines.
- So, it's rather a matter of preference to choose the rendering engine
  than a matter of font encoding.

Now regarding the Mac and Windows variations:
- In naive implementations, Thai tone marks are always placed at the
  highest level, above the upper vowels, to prevent collapsing.
- This can safely render readable Thai texts. However, it looks
  typographically bad in the cells that are composed of only a base line
  character and a tone mark, without the upper vowel.
- Elegant renderers therefore put the tone mark down at the position for 
  upper vowel.
- In the absence of intelligent rendering mechanism, additional code
  points are then designed to represent the lowered tone marks.
- Another case is the problem with base characters with upshooting tail.
  The upper vowels and tone marks have to be shifted to the left to
  provide space for the tail.
- The result is 2 sets of upper vowel glyphs and 4 sets of tone mark
- There are some other extensions, such as the 2 sets of lower vowels to
  provide space for base characters with downshooting tail, and an 
  additional set of glyph for baseless version of consonants with base.
- This needs additional code points from the original TIS 620. And Mac and
  Windows have designed their own extensions.

- Rendering engine based on tis620-0, WTT 2.0 included, can be applied
  with all tis620 fonts. But the result is inelegant.
- Mac rendering engine can be applied to tis620-1 fonts only.
- Windows rendering engine can be applied to tis620-2 fonts only.

However, WTT 2.0 has provided a perfect framework for Thai text I/O.
Illegal keyboard input sequences can be filtered out at 3 levels of
strictness. And illegal character sequences in text buffers are guaranteed
to be noticable to users when rendered.

Therefore, it would be perfect to adopt WTT 2.0 rendering engine, with the
extension to pick Mac or Windows alternate glyphs, according to their
encodings, when available.

In fact, Microsoft and MacIntosh were also in the Thai API Consortium
(TAPIC) who defined WTT 2.0, and their Thai supports are based on the
specification. However, their supports are not complete and have been 
customized in their own ways.

> Let me point out the important piece of Wtt2.0 Cell-Clustering rule in order
> to compare other Cell-Clustering rules more easily (but, please refer to
> the detail when doing the implementation).
> ****
> If the cell-cluster is composed of "consonant", "vowel" and "tonemark",
> vowel character will always follow consonant and tonemark character
> will always follow vowel as shown below.
> 	Consonant + Vowel + Tonemark  -----> One cell cluster
> If tonemark comes before vowel, the vowel character will be considered as
> another cell-cluster as shown below.
> 	[Consonant + Tonemark] [Vowel] ----> Two cell clusters
> ******
> Other Thai Cell-Cluster rules are done by various companies. I don't know
> exactly how many they are going to be. Let's focus on those popular ones
> whether they are needed to be defined as the extra names. 
> (1) Thai Microsoft Window
> (2) Thai Macintosh Window
> I think both of them will follow the simple rule as below.
> 	List of cell-cluster
> 	- consonant
> 	- consonant + vowel
> 	- consonant + tonemark
> 	- consonant + vowel + tonemark <---- ****
> 	- consonant + tonemark + vowel <---- ****
> As you can see the one with the mark **** are considered as one cell clustering
> even the sequence of vowel and tonemark are different. This is the different
> point when comparing to Wtt2.0 cell clustering.
> In my opinion, then, we might have these 2 types of cell-clustering rules
> and one has the name "wtt2.0", the other I'm not sure if we are going to
> name it or not.

"incomplete wtt2.0" maybe. :-)

We can prepare a universal rendering engine that realizes the complete WTT
2.0 and embraces Mac and Windows extensions as well. The convention on
tis620 encodings could determine which code point to use. In fact, it's
just a matter of table lookup.

> >From this point if we can finalize these, then, how idividual cell cluster
> is going to be shaped, that should depend on whichever fonts (plain, Mac,
> Microsoft) are using.
> Unfortunately, neither of them have said clearly about SaraAm case.
> >From both Thai Mac/Microsoft windows, the following is the clustering case
> for SaraAm.
> 	Consonant + SaraAm            ----> 1 cell clustering
> 	Consonant + Tonemark + SaraAm ----> 1 cell clustering
> Again, from my opinion, then, the following list should have done cell
> clustering as shown above for SaraAm case.
> 	-tis620-1
> 	-tis620-2
> 	-tis620-2.wtt2.0

To my understanding, this is due to the extensions for typographical
elegance. SARA AM (U+0E33) occupies a single cell according to WTT 2.0.
But in Windows and Mac extensions, it have to be broken into NIKHAHIT
(U+0E4D) and SARA AA (U+0E32) in rendering. So, it takes the previous cell
for its appearance. And the two cells have to be merged.

> K.Theppitak and other Thai folks, please let me know about any opinion and see
> if we can come-up with the agreement.

Hope it's useful. :-)

> To Owen,
> Addtionally, here is my answere to Owen's questions.
>  - How many different rules are in use
>    I would say, 2 cell-clustering rules. One is called Wtt2.0. The other,
>    as explained above.

I would say: there should have been ONE cell-clustering rule, but WTT 2.0 
might be too complicated for them.

And there are 3 code tables (not rules) for glyph variations

>  - Which ones we need to support
>    Just my opinion, how about these.
>    	-tis620-0
>    	-tis620-1
>    	-tis620-2
>    	-tis620-2.wtt2.0  <--- I would help on this.


  -tis620-0  -+
  -tis620-1   +-> extended WTT 2.0 (single rule, different tables)
  -tis620-2  -+

>  - How bad the problem is with legacy fonts without identified
>    clustering rules.
>    We won't be able to have Thai display correctly after we do text manipulation,
>    like, insert, delete, copy-paste, selection, scrolling, .....
>    To me, I won't trust to use those software because I'm not too sure if,
>    whatever I edit Thai text files, the result will be correct as we think
>    and as it shows.

The WTT 2.0 clustering rule is a graceful way of managing illegal cases,
which is very likely to happen in the absence of keyboard sequence
checking by XIM.

Moreover, the Mac and Windows extensions are necessary for quality
rendering of Thai.

Theppitak Karoonboonyanan
Software and Language Engineering Laboratory, NECTEC  mailto:theppitak nectec or th

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]