Components we need



I went through the LDML specification, and it seems that we need lots of
components just to be able to handle CLDR files properly.

This is a random list again:

1) Unicode Sets: <http://www.unicode.org/reports/tr35/#Unicode_Sets>

A Unicode Set is something like a regular expression, used in the LDML
spec for defining sets of "characters". But characters here don't
necessarily mean Unicode characters. They may be things which are
considered characters (i.e., elements of writing system) in some
locales. For example, the Latin Serbian exemplar set is this:

   [a-p r-v z đ ć č ž š {lj} {nj} {dž}]

Where "lj", "nj", and "dž" are actually multi-character Unicode strings.

Unicode Sets are used in several places. My list may be incomplete, but
it's used in at least exemplar sets (list of characters used in each
locale), formatting currencies, and collation.

To implement this properly, you need to support some kind of regular
expression that is really clearly specified, but will require having
almost all of the information from the Unicode Character Database at
hand when parsing CLDR files. For example, a Unicode set may be defined
as [:age=3.2:] which is all the characters introduced in Unicode 3.2, or
[:jg=Beh:] which are all Arabic letters in the same joining group as of
Beh.

Also, we would need an index from Unicode characters names to their
codes, since the Unicode Sets also allow syntax like \N{EURO SIGN} to
refer to U+20AC.

2) Number Format Patterns:
<http://www.unicode.org/reports/tr35/#Number_Format_Patterns>

These are used in formatting numbers and currencies, and seem to come
from Java. These include features like rounding to the nearest 0.65,
involving half-even rounding.

They are not very clean to implement, but we could probably borrow a lot
of code from ICU for these.

3) Unicode Collation Algorithm (UCA), its default data file (DUCET) and
related algorithms used in ICU customizations:
<http://unicode.org/reports/tr10/>

This is needed to support anything about collation. Some of the
customization requirements are not understandable and are really
references to ICU behavior, so we probably need to go through ICU code a
lot for this.

For collation to work according to the UCA, we also need to support
normalization (perhaps only to FCD, some loose NFC or NFD which makes
equivalent strings treated equally by UCA).

It is also noteworthy that having a single latest version of the DUCET
would not suffice, since collation rules may require a certain version
of DUCET for collation in a <version> tag. In order to do this properly,
that is according to a certain version of Unicode and UCA (as asked), we
may also need to have older copies of Unicode data, but I'm not really
sure about that.

4) Alternate calendar handling code:

To handle any of the alternate calendars, we need code to support date
conversion and related tasks for different calendars. My communications
with the authors of the book Calendrical Calculations, which is the
reference for CLDR (and the best book available on the matter), in
trying to convince them to release some of their code/algorithms under a
free software license (or allow work based on them to be released as
free software) has been unsuccessful. They mentioned an unfruitful
discussion with RMS about this, but their main point was that they get
licensing fees for the usage of algorithms in proprietary software,
which would decrease if someone release the code as free software.

Some of the authors' older code is available in Emacs under the GPL. I
don't know if we may be able to get the permission of FSF to release
derivate work base on that under LGPL, but we may try anyway.

The quality of the alternate calendar code available otherwise as free
software is usually poor.

5) Locale-dependent Algorithms for case-insensitive comparison,
uppercasing, lowercasing, and titlecasing:

This is needed in many places where something is case insensitive. For
example, exemplar sets are case insensitive, and while [a-z] is
specified in the file, an API should return an equivalent of [A-Za-z]
when asked for the data.

The casing algorithms and data are different for Turkish, Azerbaijani
(as written in Latin), and Lithuanian. I believe there is code taking
care of this in glib.

Titlecasing is needed in some languages like Czech and Russian, when
default casing of some names is different when used in running text and
when used in lists in the GUI.

6) XPath: Fortunately, this is available in libxml2.

7) POSIX regular expression handling: This is needed for handling
"yesexpr" and "noexpr". We could probably copy the code directly from
glibc.

There are also several small algorithms specified in the text of the
specification (like one for fallback localization of timezone names),
which should be easy to implement as soon as we decode the spec, which
may require looking at ICU code.

roozbeh





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]