Re: Font priorities



At 10:57 AM 2/5/2003, Owen Taylor wrote:
On Wed, 2003-02-05 at 13:50, Eric Mader wrote:
> At 09:13 PM 2/4/2003, Owen Taylor wrote:
> >One possible thing we'll be able to do in future versions of Pango, is
> >that I plan to do script detection, so I'll be able to tell that
> >a run of text is in Devanagari ... if you then had a table mapping
> >from language => script, you could tell that the 'en' language
> >tag _couldn't_ be appropriate for the Hindi text, though you would
> >have no idea what language tag _was_ appropriate.
>
> ICU has code which maps a code point to a script. There is also code to
> identify runs of text all in the same script, taking neutral characters
> into account. For example, Arabic words separated by spaces will be
> returned as a single run.
>
> (The ICU code which maps code points to script uses the general Unicode
> properties mechanism, which pulls in quite a bit of ICU; I have a version
> of this code which uses a table built by an ICU application so you can
> avoid the direct ICU dependencies...)

Yeah - I already have a port of this code to use in Pango :-)

See attachments to:

  http://bugzilla.gnome.org/show_bug.cgi?id=91542

What remains to be done is hooking it up to shaper selection.

You rock! A few months ago I was using this code as part of a process of splitting a paragraph of text into runs of text in the same script, direction and font, and found that the script run code, as currently written, interacts strangely with the bidi code: here's a short summary of what I found:

I was thinking about whether to compute the script runs over the whole paragraph or for each directional (and font?) run. Whichever way I do it, I can imagine a case where it will do the wrong thing. First assume that I find the script runs within each directional run. Given the input text "english (ARABIC) hindi." The directional runs will be, of course "english (", "ARABIC" and ") hindi." I'll assign the whole first run to the Latin script, the whole second run to the Arabic script, and the whole third run to the Devanagari script. If the text was all left-to-right, I'd assign the closed paren to the Latin script 'cause the open paren got assigned to Latin. It seems like a mistake to let the change in directions change what script characters get assigned to.

So, what happens if I compute the script runs up front, and then intersect them with the directional runs? This case above would work out like I want. But what about this simple case: "english ARABIC more english." The script runs would be "english ", "ARABIC " and "more english." The directional runs will be "engilsh ", "ARABIC" and " more english." This means that the space after the word "ARABIC" will be assigned to the Arabic script, even though Bidi processing said it's a left-to-right space. This doesn't seem right either...

What this seems to mean is that the simple way that the script run code assigns scripts to neutral characters isn't good enough. It seems like it needs to take more than just the raw script ID's into account... maybe it needs to do Bidi analysis too? Maybe I need a function that does Bidi and script runs at the same time?


Regard,
                                            Owen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]