Re: Components we need

From: Roozbeh Pournader <roozbeh farsiweb info>
To: Bruno Haible <bruno clisp org>
Cc: GNOME Locale mailing list <locale-list gnome org>
Subject: Re: Components we need
Date: Tue, 16 Aug 2005 19:18:21 +0430

On Tue, 2005-08-16 at 14:53 +0200, Bruno Haible wrote:

> Therefore, for items which are purely string -> string mappings, the
> obvious choice is GNU .mo format. Only for other kinds of data (like
> integers, numeric data, bit tables, arrays of strings etc.) we need
> to invent an ad-hoc binary locale data format.

Well, I'm not sure how we want to so some of the CLDR things, but the
way CLDR wants to look at most of the data is mostly string->string
mappings, using XPath as the key. See Appendix I: Inheritance and
Validity:

http://www.unicode.org/reports/tr35/#Inheritance_and_Validity

The parts it doesn't do so, are three things:

1) Aliases: These are relative XPath pointers into this or another tree.
We should probably expand this at "compile time".

BTW, this is not a really clean XPath, since its base or destination may
not exist at all. For example, somewhere one may have an alias pointing
from a node in fa_IR named "//ldml/x/y" to "../z", with fa_IR having no
<z> node and "fa" (where the "../z" would resolve to, based on
inheritance) having no <y> node.

2) Attribute-information elements: These are when there is no string as
the value, but some data in XML attributes. These are:

  a) <default>: This marks the default for things. For calendars, it may
  be gregorian, etc. For date format strings, it may point to either of
  them, etc.

  b) <firstDay>: values are: sun, mon, tue, wed, thu, fri, sat.

  c) <mapping>: This takes a registry and a charset name. The registry
  is currently fixed to be "iana". So this is really a charset name
  currently.

  d) <measurementSystem>: values are metric, US, UK.

  e) <minDays>: a number from 1 to 7.

  f) <orientation>: this takes two directions, one for characters and
  another for lines. Values are left-to-right, right-to-left,
  top-to-bottom, bottom-to-top.

  g) <settings>: as attributes, this has a list of attributes with
  values which are mostly on-off values, but in some cases take five
  different values. This should probably be separated into its parts.

  h) <weekendStart> and <weekendEnd>: These take a weekday (like
  <firstday>) and a time of day, which ranges from "00:00" to "24:00",
  inclusive. The spec doesn't go into details of it second or subseconds
  may be used or not, but I assume that seconds may.

3) Collation info: all that appears in a <collation> element should
probably be considered an atomic lump of data. I don't know if we should
do as glibc does and keep the instructions as applied to the default UCA
collation data (CPU efficient), or keep the data only (memory efficient
for non-Han locales).

Some of our potential users (like libxslt) would also need to set some
of the switches in collation settings, like the uppercase-before-
lowercase mode. Not having gone through all of UCA, I don't know if we
need the original rules to do this.

But I guess we would be much more memory efficient if we don't stick to
the XPath key model.

Then, there is also the supplemental data, which is the same with
different locales, and would not take much memory anyway. It's not
strings at all.

roozbeh

Follow-Ups:
- Re: data types and inheritance
  - From: Bruno Haible

References:
- Components we need
  - From: Roozbeh Pournader
- Re: Components we need
  - From: Bruno Haible
- Re: Components we need
  - From: Roozbeh Pournader
- Re: Components we need
  - From: Bruno Haible

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]