Re: Concerning Keyboard Status Menu
- From: Mike Qin <mikeandmore gmail com>
- To: Debarshi Ray <rishi is lostca se>
- Cc: desktop-devel-list gnome org
- Subject: Re: Concerning Keyboard Status Menu
- Date: Sat, 24 Nov 2012 19:48:49 -0500
On 24/11/12 06:40 PM, Debarshi Ray wrote:
The thing for Chinese input method is: Few of them are doing a good job.
Styling of Chinese, dialect, modern Chinese cultures idioms *varies*.
Even the big commercial input method failed to achieve a good job on
every aspect mentioned above. That's why you saw several of commercial
input method installed even on a single user desktop. This is why input
method tend to be inconsistent.
The default pinyin input GNOME whitelisted is ibus-pinyin. It's a very
basic input engine that doing a relatively poor job on almost every
aspect I mentioned above. And I'm not being offensive to those
developers, Sunpinyin is no better than that.
Develop a Chinese IME is *extremely* hard and it has commercial
barriers. Big search engine companies have much complete training
dataset than any opensource organization, commercial dictionaries from
Chinese internet media companies are covering every aspect of Chinese
culture: ancient poetry, modern word, idiom...Companies like Microsoft
and Google have a much more sophisticated Machine Learning Research
Group than any opensource organization...
The question is, if it is so hard to develop a Chinese IME, then why not
join together to improve it instead of having lots of half-finished ones?
If we are so low on resources then we should try to avoid fragmentation,
shouldn't we?
Good question! As a 20-year Chinese native speaker, I would say that's
impossible. This has never happen in the commercial input method world,
and this is never going to happen in the opensource world either.
The situation of Chinese Language as well as input method is extremely
complex. Workload of a complete universal input engine incredibly huge!
First. No one really know how to speak "Chinese". There are too many
dialect. For instance my girlfriend is from Zhejiang and there will be a
new dialect every 10km. Yes, these are new dialect, people speak
different dialect *could not* understand each other. Some of these
dialects have characters, say Cantonese, some of even cannot be fully
expressed by Han character. (So that's why the Han character standard
has been extended several time.)
Second, ways of inputing Chinese is so different. Pinyin is one, it
basically encode the way Chinese are read. Besides Pinyin, there are at
least I (who always failed my Chinese exam) know Wubi, Shuangpin, Erbi,
Zhengma. All of them are complex enough to implemented a individual engine.
Third, just pick Mandarin Pinyin as an example, because Han character
are not letter based, the problem of input method is basically the same
as Speech Recognition. Several sub-problems of this topic are highly
open. For instance, natural language segmentation, dictionary mining,
context inference... These problems are so open that no engine developer
is sure that this way is the best way. In fact, we all encourage each
other to try new approach, because the current UX of opensource input
method is still way behind a commercial one that we use on Windows.
Fourth, patent issue. As I mentioned in the first email, patent are
discouraging open source input methods using commercial dictionaries.
Because these dictionaries are either collected manually, or using
sophisticated Machine Learning techniques mining on massive dataset that
we don't have.
As a result, there is no "universal" input engine for Chinese. But each
of the engines have its uniqueness. Take Mandarin Pinyin as an example:
* ibus-pinyin tend to be simple to hack, but provide poor UX since it
does not consider language context. It's under GPL license.
* sunpinyin is more sophisticated, it uses 3-grams to overcome the
Non-Markov property of Chinese. But still the dictionaries and the
datasets are a problem. And the LGPL license and its history that
originated from Sun Microsystem scared a lot of package maintainer away.
* libpinyin is considered the successor of sunpinyin, but under heavy
development. It's still considered as unstable now.
* rime sounds different, they seems to target at people who really
appreciate the beauty of ancient Chinese. (Correct me if I'm wrong of
course)
As I said, each approach is a complete approach. They're *not
fragmented*. We're not sure which one is the good idea, we're still
trying to see which one is better. It feels pretty much like research,
we all know every current approach sucks, and we're exploring different
ways to make it better. If you focus on one of them, we lose the whole
opportunities to make it better.
Cheers,
Debarshi
--
Thanks
Mike
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]