Re: Fw: Language tags

Owen Taylor wrote:
> I appreciate you taking the time to to send your thoughts on this
> issue. It's always good to get input from a real expert in the area.
> I'm encouraged by the fact that while I only spent a few hours
> cataloging existing practice, I had managed to locate most of the
> examples mentioned by you - ISO 639 and RFC 3066 of course, but also
> ISO 15924, OpenType script/language tags, Microsoft LangID's and so forth.
> It's certainly true that the purpose of language tags in Pango does
> not exactly conform to the intent of RFC 3066; I'd describe the
> purposes of the Pango tagging as:
>  Providing any information about the language of the source text that
>  would be useful for the processes of displaying and editing the text
>  in written form.  These processes include hyphenation, boundary
>  resolution, and font and glyph selection.
> There are certainly variants of practice in these areas that have no
> correspondance to spoken language. But I believe that RFC 3066
> language tags are close enough in intent and form to be quite useful
> for this purpose - certainly closer than ISO 15924 script tags.

I wouldn't agree with that, for reasons I explained a few days ago. For
Chinese at least (the only place I am qualified to comment) these tags
are unrelated to the script, and only closely related to spoken
language. However..........

> And practically speaking, language information from higher level
> protocols (HTTP, mail, etc), will most frequently be in RFC 3066 form,
> so anything else would be quite inconvenient for applications.

I would certainly agree with that. Whatever form of tagging you use
within Pango, the only tagging you are likely to receive from higher
levels (at least in the near term) if RFC3066 or similar. For practical
reasons you have no real choice but to work with RFC3066 tags.

> Since the form I'm proposing for Pango script tags is to use the RFC
> 3066 form, with arbitrary numbers of subcomponents, and no
> interpretation of the subcomponents, I believe it should be no problem
> to accomodate future extensions. If the use of multiple different tags
> for the same language becomes frequent, than an aliasing mechanism
> might be necessary, but that should be easy to add.

One things Peter Constable said intrigued me, though. He described the
plane 14 Unicode tags as a Bad Idea. From what I have seen they are not
bad, but just inappropriate. In all cases I understand they seem to be a
reasonable indicator of spoken language, and would therefore be
appropriate tags for speech processing. They just don't cut it for
script processing. If that is not generally the case I would be
interested to know.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]