Looking up fonts and shapers in Pango




In the first stage of Pango - itemization, there are essentially four
tasks involved:
 
 - Identification of bidirectional runs
 - Choosing a "language engine" (handles word/character/line break determination)
   for each character
 - Choosing a font for each character.
 - Choosing a shape engine for each character (handles conversion of 
   characters into font-specifc glyphs)

I'm running into some difficulty figuring out exactly how the last two
should work, so I thought I'd send my thoughts here and hopefully get
some useful feedback.

The difficulty with the last two is that they are interrelated, and
can potentially involve a lot of computation.


The inputs into the process are:

 - The character (unicode code point)
 - A description of the font attributes 
   (family name, slant, weight, etc.)
 - The applicable language tag for the segment of text

As well, as some global parameters:

 - A database of applicable fonts
 - A database of available shaper engines
 - Control parameters that effect the font selection process
   (visual quality / exact match control parameter, etc.)

The outputs are:

 - The font
 - The shape engine


To give some idea of the interactions that can go on:

 - A shape engine can form pre-composed characters by composing
   individual glyphs (you can display á if you have an accute
   accent and an a). So, depending on the shape engines available, 
   a character may, or may not be covered by a font.

 - Because of the national variants for CJK characters, its possible
   that one font should be used for a codepoint for language A and a
   different font for the same codepoint for language B. However,
   if the second font is not available, than the first font may
   be an acceptable fallback for a language B.

 - Because different font systems (native X fonts and libart fonts,
   for instance) may be used simultaneously, the set of shape engines
   possible will depend on the font (or vice versa).


I would consider it desirable to put an ordering on the process - to
either pick the font first, and then subsequently pick the shaper
based on that, or to pick the shaper first, and then select the font.
Picking the font first is more natural, since the fonts are the
part that are user visible and visible to the programmer working
at a high level.

So, saying you decide to pick the font first, then you need to be able
to figure out the coverage of a font for a particular language tag.
Once you have those coverages, then you can select an appropriate
font. And from there, the implementation of a :

 PangoShaper *pango_font_get_shaper (PangoFont *font,
                                     PUChar ch, char *lang_tag);

method is straightforward.

Language tags can be dealt with by, saying that the language tag
is "en_US", first checking coverages for "en_US", then for "en",
then for "".

The brute force approach for determining the coverages is to load up
each shaper that is appropriate for the font. (i.e. each shaper
registered for the font technology) and query the coverage of that
shaper for the given language tag and font. Then take the union of all
the ranges.

The problems with this approach are:

 - Loading up each shaper is expensive to begin with and the query
   process more so, so you'd probably need to cache the information across
   applications, with the resulting difficulties with 
   up-to-dateness of the cache, locking, font installation, etc.

   One thing that could help a lot of these problems would be using
   a font-lookup server. But this provides its own problems, since Pango
   is supposed to sit below the level where IPC mechanisms, configuration
   file formats, activation services, etc, are defined.

   A possible approach is to simply provide hooks for such a server,
   and then a sample implementation on top of the X client message,
   ICE, CORBA or whatever.

 - The query process can be quite time consuming for something like the Basic
   shape engine which does composition. It would basically consist
   of iterating through the Unicode code space and checking if each
   character is covered.

 - Ranges for CJK fonts can be very fragmented, when a JIS, or GB2312, etc
   font is remapped into Unicode, the subset of the CJK character
   space covered by the font will not be a continguous block by 
   any means. The most compact representation of the coverage may 
   well be a 8k bitmap. (Another good candidate is a two-level
   table of n/256 32byte bitmaps.)

Still, given caching, this approach is probably within the realm 
of feasibility. 


The other approach that I've thought of for handling the font-coverage
problem is to simply punt the procedure to the installation process.

That is, to assume that the person/vendor that is setting up system
fonts, and also the person distributing 3rd party fonts have good
knowledge of what shapers will be available. For instance, somebody
installing an X font encoded as tscii-0 assumes that a Tamil shaper is
available and then claims the coverage range for the font is
U+0B80 - U+0B8F, and installs some appropriate line in a global
configuration file.

The worry I have with the second approach is that it means that the
font installation process has to be very tightly coupled to Pango,
and while I hope that Pango will be very widely used, a high level
installation complexity would hinder that process.


Anyways, those are my current thoughts on the matter. I'd very much
like to hear if people have better ideas either for the overall method
of font/shaper selection, or for the coverage-determination
problem. In the absence of advice, I'll probably start on a prototype
of the brute-force approach, and see just how bad it is.

Regards,
                                        Owen



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]