Re: Font lookup ranges [was Re: Notes on Pango Xft backend]



 --- Keith Packard <keithp keithp com> wrote: > 
> Around 4 o'clock on May 30,
> =?iso-8859-1?q?Andrew=20Dunbar?= wrote:
> 
> > >  - The set of languages in the OS/2 table /
> FC_LANG
> > > is pitfully
> > 
> > Can't you use coverage to determine this?
> 
> Not easily.  Traditional Chinese, simplified
> Chinese, Japanese and Korean
> fonts cover the same Unicode regions, and fonts for
> all of these languages
> generally cover only a fraction of the total space
> making any coverage 
> based language tag only a guess at best.  In
> particular, we'd need to call 
> upon an expert in the area of the two Chinese
> varients to get an idea if 
> there were any codepoints distinguishing the two.

Well yes and no.  Korean uses Traditional Chinese
style so it's safer to mix those two.  Simplified and
traditional use mostly similar styles and I'm not
aware of any codepoint that needs to be rendered
differently for each language as the two versions all
have seperate codepoints.  But mixing a Japanese and
a Chinese/Korean style generally offends somebody.
The usual example is U+6D77 which has a different
stroke count for Japanese vs the others.  Stroke count
is an important property in CJK as it has a role to
play in using dictionaries and people are sensitive
to this.  You can see that the Chinese versions have
two "dots" in the middle of the grid whereas the
Japanese version has a vertical bar.  Apparently this
is the kind of thing the Japanese dislike about
Unicode:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=6d77
Also Japanese generally seem to prefer a "sans serif"
look while the Chinese prefer a more "caligraphic" or
"serifed" look and these don't mix and match well.

In Japan there is a set of 1,850 characters that
everybody has to know by the end of high-school.
There's probably an equivalent for the Chineses.
Somebody knowlegable could probably build a table
useful for heuristics or you could do a frequency
count using web pages and make a table from that.

Anyway I think if we can make an educated guess at
low computational cost it will hopefully be better
than nothing?

> > For now yes.  Romanian uses a "comma below" some
> letters which Unicode has
> > mapped onto a cedilla.
> 
> This is a minor issue by comparison, but the same
> basic problem.  We'll 
> see if people using that language start to rise up
> in revolt as the Han 
> language groups have, then we can start looking for
> yet another kludge.

I investigated further last night and it seems these
characters have been awarded separate codepoints after
all.  I'm pretty sure there will be new cases in the
Indic ranges where Unicode recommends using codepoints
from Devanagari for various symbols in the other
scripts but these are hardly used yet.

Andrew Dunbar.

> Keith Packard        XFree86 Core Team        HP
> Cambridge Research Lab
> 
>  

=====
http://linguaphile.sourceforge.net http://www.abisource.com

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]