Locale tags in translated XML



Hi all,

This is long and somewhat technical. Sorry. If your eyes
start going cross-eyed, feel free to ignore.

I've been trying to figure out how to make our POSIX-style
locale tags cooperate with the BCP47 tags that are commonly
used on the Internet and in XML. It's bothered me for a long
time that we put non-BCP47 tags in our xml:lang attributes.

Besides being against recommended practice, it also means
that the XPath lang() function doesn't work as well as it
ought to. This could come into play with the new Mallard
conditional processing work.

Also, with my recent work on itstool, I don't really want
to have it output non-standard XML.

What I'd like to do is have us always use BCP47 locale tags
in the xml:lang attribute in our XML documents. They would
still be managed by PO files named according to our locales,
and we would still install them according to our locales.
For example, we'd still have sr latin po, and the XML would
still go to /usr/share/gnome/help/.../sr@latin/, but the
xml:lang attribute on the actual xml files would be

  xml:lang="sr-Latn"

I'll convert these automatically when generating the XML.
I've detailed the conversion below.

This also means the XSLT internationalization in Yelp has
to change. Right now, it expects POSIX-style locales, and
it falls back to base languages by trying all combinations
of the parts of the tag in a specified order. (I think I
copied that order from GLib.)

So, I would have to make the i18n machinery in yelp-xsl
expect BCP47 locales, because it matches on the document's
language, not your locale. What I would do is treat any
locale as a list of tokens separated by non-alphanumeric
characters, then simply chain up by removing trailing
parts.

There's a slight loss of functionality here. Currently,
for a document declared as sr_SR@latin, Yelp can use the
localizations for sr@latin (as well as sr_SR or just sr).
Under the new scheme, it can still find translations in
sr_SR and sr, but not sr@latin.

Practically speaking, we don't have translations that use
sr_SR@latin. Sure, it might be your actual LANG on your
computer, but we don't have PO files for it. So I don't
think we're actually fully exercising Yelp's capabilities
with fallback languages.

Translators would still manage the yelp-xsl translations
as they do now. I'd handle the locale tag conversion at
build time when generating the XML string catalog.


Converting Tags
===============

Our locale tags take the form:

  ll_RR@variant.charset

Where ll is the primary language, RR is a region (country),
variant is some sort of variant, and charset is the character
encoding to use.

BCP47 locale tags look generally like:

  language-script-region-variant

I will from POSIX-style locales to BCP47 as follows:

The charset will be dropped. It's not relevant for what I'm
doing here. The language and region will be copied into their
correct places. Note that in BCP47, region comes after script.
The variant will be converted on a case-by-case basis:

@cyrillic will be changed to Cyrl and used as script.
@devangari will be changed to Deva and used as script.
@euro will be dropped.
@ije will be used as variant.
@latin will be changed to Latn and used as script.
@shaw will be changed to Shaw and used as script.
@valencia will be used as variant.

Looking through locale -a, that leaves @abegede, @iqtelif,
and @saaho. I haven't looked into what these are yet. Any
variants I don't special-case will be used as the variant
in the BCP47 tag.

Examples:
  ca@valencia       ->  ca-valencia
  en@shaw           ->  en-Shaw
  ks_IN@devanagari  ->  ks-Deva-IN
  sr@ije            ->  sr-ije
  sr_RS@latin       ->  sr-Latn-RS
  uz_UZ@cyrillic    ->  uz-Cyrl-UZ

Converting back from BCP47 to POSIX is harder. Luckily,
I don't think we need to do it.

Comments?

Thanks,
Shaun




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]