Re: Language tags

Owen Taylor wrote:
> The last major API bug I'm planning to deal with for Pango before 1.0
> is to clean up the language tag handling.
> The idea of a language tag is that it identifies the language
> of a portion of text.
> The primary use of language tags in Pango is for taking lanugage ito
> account when shaping - for instance, to choose between simplified and
> traditional chinese character shapes.
> [...]
> Looking around a bit, the solution to 1) was pretty easily to
> hand - a specification for language tags can be found in RFC 3066;
> these language tags, which are of the form:
>  en-us
>  zh-hakka
>  zh-min-nan
> Are used for http, html, xml, mail, and are the form of language tag
> currently supported by the Unicode plane 14 language tag
> characters. So, using these tags will give wide application
> compatibility. They also are compatible with the rough idea of 'en_US'
> being a language tag - if the first component of a language tag is 2
> characters, it must be a ISO 639-1 language code; if the second
> component of the tag is 2 characters, it must be a ISO 3166 country
> code.

These two sections don't seem to tie together too well. zh-hakka and
zh-min-nan are purely issues of spoken Chinese, and have no bearing on
how it is written. The information you said you would need at rendering
time - simplified versus traditional - is another issue entirely. Using
these language tags does not, therefore, seem to provide the information
you need.

Having said this, I'm not sure it matters for Unihan. I am not aware of
any instance where the look of an Unihan character is dependant on
whether the text is simplified or traditional Chinese. This was an issue
early Unicode, where there was some merging. In Unicode 3 all
traditional and simplified characters co-exist, with separate code
points. Text may freely mix the two, which it often does in real life
(e.g. a simplified Chinese text containing a Hong Kong company name
would normally have that name rendered in its traditional form - at
least in typesetting, where such free intermixing has worked well for
years). If you want to try mixing existing fonts in the rendered output,
there may be some issues, but these have nothing to do with the language

There would be issues with knowing if the text is Chinese or Japanese
(and presumably with Korean, but that's beyond my knowledge), but that
would be OK - any zh-* renders one way, and any ja-* renders another
(the zh-* part I know is true, the ja-* is an assumption on my part).


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]