Re: Industry Thai Cell-Clustering Rules



Hi K.Theppitak,

] 
] 
] > Now, we need to talk about Cell-clustering rules which are going to be used
] > with fonts.
] > 
] > As far as I know, there is only 1 cell-clustering rule defined from Thai
] > government (by NECTEC). This one is called Wtt2.0 and the detail is
] > attached.
] > We should add the word "wtt2.0" to any names if they are using Wtt2.0 cell
] > clustering rule.
] > 
] > Ex:
] > 
] >   -tis620-0.wtt2.0 for plain tis620 with Wtt2.0 cell clustering rule
] >   -tis620-1.wtt2.0 for MacThai extension with Wtt2.0 cell clustering rule
] >   -tis620-2.wtt2.0 for WindowsThai extension with Wtt2.0 cell clustering
] > rule
] 
] In my opinion, the need for the ".wtt2.0" extension is quite doubtful,
] because WTT 2.0 is not dependent on font encoding variations.

This is the point. WTT2.0 isn't dependent on font encoding but, for sure,
it talks about Cell-Clustering. Whether Cell-clustering is specific to
font or not, again, I don't know what to say. 

Let me show you then, you might help to answer me if
Cell-clustering is font-specific or not.

Here is the piece of Thai pango engine codes which are for determining,
the Thai cell cluster.

    while (p < text + length)
     {
        ...

	if (wc >= 0xe30 && wc < 0xe50)
	   group = groups[wc - 0xe30];
	else
           group = 0;

	switch (group)
	{
        case 0:
	  if (base)
            {
              add_cluster (font_info, glyphs, cluster_start, base,
              			group1, group2);
		group1 = 0;
		group2 = 0;
            }
          cluster_start = p - text;
	  base = wc;
	  break;
	case 1:
	  group1 = wc;
	  break;
	case 2:
	  group2 = wc;
	  break;
	}

      p = g_utf8_next_char (p);
 
      ....

    }


This piece of code determines Cell-Clustering for Thai and, of couse,
it doesn't use Wtt2.0 Cell-clustering logic.

Now, if users want to have Thai displayed as Wtt2.0 Cell-clustering,
what will you do ?

That's why, we put Cell-clustering to XLFD name, so that, the engine
can determine which cell-clustering rule should be used. Then, users
would be able to choose what they want.

It's not really clear cut to say that Cell-clustering is not specific to
font. Unfortunately, in the industry, there are more than 1 cell-clustering
rule.


] 
] Let me say:
] - tis620-1 and tis620-2 are proper supersets of tis620-0.
] - So, any rendering engine that works with plain TIS 620 (i.e. tis620-0)
]   can be applied to all above categories of tis620 registry. And WTT 2.0
]   provides one of such engines.
] - So, it's rather a matter of preference to choose the rendering engine
]   than a matter of font encoding.

K.Theppitak, let me make sure that we are not mixing between Cell-clustering
and font-shapping. As you mentioned that it's a matter of preference to
choose the rendering engine to have either Wtt2.0 cell-clustering or
(let's say) non-Wtt2.0 cell-clustering, how would you be able to do that
with Pango ?

For the long term, we hope to have the new font technology to put this
cell-clustering information into the font and if that is possible, then,
we don't need to put .wtt2.0 into XLFD. Now, we are talking about short-term
and in order for Thai Pango engine to be able to provide the right
cell-clustering to what the users need.


] 
] Now regarding the Mac and Windows variations:
] - In naive implementations, Thai tone marks are always placed at the
]   highest level, above the upper vowels, to prevent collapsing.
] - This can safely render readable Thai texts. However, it looks
]   typographically bad in the cells that are composed of only a base line
]   character and a tone mark, without the upper vowel.
] - Elegant renderers therefore put the tone mark down at the position for 
]   upper vowel.
] - In the absence of intelligent rendering mechanism, additional code
]   points are then designed to represent the lowered tone marks.
] - Another case is the problem with base characters with upshooting tail.
]   The upper vowels and tone marks have to be shifted to the left to
]   provide space for the tail.
] - The result is 2 sets of upper vowel glyphs and 4 sets of tone mark
]   glyphs.
] - There are some other extensions, such as the 2 sets of lower vowels to
]   provide space for base characters with downshooting tail, and an 
]   additional set of glyph for baseless version of consonants with base.
] - This needs additional code points from the original TIS 620. And Mac and
]   Windows have designed their own extensions.

Again, K.Theppitak, the above information is about "Shapping" and we are
more focusing on Cell-clustering. I just want to make sure that we are
trying to sort out how the users are able to select different rules of
cell-clustering.

] 
] Therefore:
] - Rendering engine based on tis620-0, WTT 2.0 included, can be applied
]   with all tis620 fonts. But the result is inelegant.
] - Mac rendering engine can be applied to tis620-1 fonts only.
] - Windows rendering engine can be applied to tis620-2 fonts only.

Same as above.

] 
] However, WTT 2.0 has provided a perfect framework for Thai text I/O.
] Illegal keyboard input sequences can be filtered out at 3 levels of
] strictness. And illegal character sequences in text buffers are guaranteed
] to be noticable to users when rendered.

K.Theppitak, I just want to make sure that, please don't mix-up Wtt2.0 Input
and Cell-clustering rules which are for Output together.

Pango is designed for Output and we need to assume that all kinds of
data whether illegal sequence or not, must be able to display in the proper
way. Then, having Wtt2.0 input checking is not relevant to have different
cell-clustering rule for output.


] 
] Therefore, it would be perfect to adopt WTT 2.0 rendering engine, with the
] extension to pick Mac or Windows alternate glyphs, according to their
] encodings, when available.

That's right. That's why we are trying to put .wtt2.0 to XLFD so that,
Thai pango engine will be able to choose various Cell-clustering rules
for output and for text manipulation operation like cursor movement,
selection, copy-paste.....


] 
] In fact, Microsoft and MacIntosh were also in the Thai API Consortium
] (TAPIC) who defined WTT 2.0, and their Thai supports are based on the
] specification. However, their supports are not complete and have been 
] customized in their own ways.

That is so true and let me say something. While they were contributing
this Wtt2.0 specification, why were they not able to have them implementing
correctly in their software. Isn't it weird ?

] > 
] > In my opinion, then, we might have these 2 types of cell-clustering rules
] > and one has the name "wtt2.0", the other I'm not sure if we are going to
] > name it or not.
] 
] "incomplete wtt2.0" maybe. :-)

That's fine by me.


] 
] We can prepare a universal rendering engine that realizes the complete WTT
] 2.0 and embraces Mac and Windows extensions as well. The convention on
] tis620 encodings could determine which code point to use. In fact, it's
] just a matter of table lookup.

So, you also agree to have wtt2.0 cell-clustering as another choice for
users.


] 
] > >From this point if we can finalize these, then, how idividual cell cluster
] > is going to be shaped, that should depend on whichever fonts (plain, Mac,
] > Microsoft) are using.
] > 
] > Unfortunately, neither of them have said clearly about SaraAm case.
] > >From both Thai Mac/Microsoft windows, the following is the clustering case
] > for SaraAm.
] > 
] > 	Consonant + SaraAm            ----> 1 cell clustering
] > 	Consonant + Tonemark + SaraAm ----> 1 cell clustering
] > 
] > Again, from my opinion, then, the following list should have done cell
] > clustering as shown above for SaraAm case.
] > 
] > 	-tis620-1
] > 	-tis620-2
] > 	-tis620-2.wtt2.0
] 
] To my understanding, this is due to the extensions for typographical
] elegance. SARA AM (U+0E33) occupies a single cell according to WTT 2.0.
] But in Windows and Mac extensions, it have to be broken into NIKHAHIT
] (U+0E4D) and SARA AA (U+0E32) in rendering. So, it takes the previous cell
] for its appearance. And the two cells have to be merged.

That's why, for the case of elegance typographical, both 2 SaraAm cases
are considered as cluster due to they split SaraAm into 2 pieces and it's
hard for text-manipulation like selection/cursor movement/... to treat them
idividually. That's why, seems to make sense to treat them as the same
cluster.

] >  - How bad the problem is with legacy fonts without identified
] >    clustering rules.
] > 
] >    We won't be able to have Thai display correctly after we do text
] >    manipulation,
] >    like, insert, delete, copy-paste, selection, scrolling, .....
] >    To me, I won't trust to use those software because I'm not too sure if,
] >    whatever I edit Thai text files, the result will be correct as we think
] >    and as it shows.
] 
] The WTT 2.0 clustering rule is a graceful way of managing illegal cases,
] which is very likely to happen in the absence of keyboard sequence
] checking by XIM.

Just make sure that, please think about other cases where the input isn't
from the keyboard and it won't have Wtt2.0 input illegal checking.
Then, the data might be already illegal sequence inside, then, the output
module would be able to display them and shown them to users that
they are also illegal sequnences in their data.

] 
] Moreover, the Mac and Windows extensions are necessary for quality
] rendering of Thai.

So is Pango.

Chookij V.


] 
] -Theppitak.
] ____________________________________________________________________
] Theppitak Karoonboonyanan
] Software and Language Engineering Laboratory, NECTEC
] http://www.links.nectec.or.th/~thep/  mailto:theppitak nectec or th
] 
] 





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]