Re: Explicit LRO, RLO, etc. in FT2 patch



Tony Graham <Tony Graham ireland sun com> writes:

> Owen Taylor wrote at 23 Aug 2001 13:50:58 -0400:
>  > Tony Graham <Tony Graham ireland sun com> writes:
>  > > The following patch expands the list of characters that are
>  > > effectively passed over as glyphs to include all non-graphic
>  > > characters.
> ...
>  > > Index: modules/basic/basic-ft2.c
>  > > ===================================================================
>  > > RCS file: /sgnome/cvsroots/GNOME/pango/modules/basic/basic-ft2.c,v
>  > > retrieving revision 1.11
>  > > diff -r1.11 basic-ft2.c
>  > > 223c223
>  > > <       if (wc == 0x200B || wc == 0x200E || wc == 0x200F)	/* Zero-width characters */
>  > > ---
>  > > >       if (!g_unichar_isprint (wc))	/* Zero-width characters */
>  > 
>  > Hmmm:
>  > 
>  > gboolean
>  > g_unichar_isprint (gunichar c)
>  > {
>  >   int t = TYPE (c);
>  >   return (t != G_UNICODE_CONTROL
>  > 	  && t != G_UNICODE_FORMAT
>  > 	  && t != G_UNICODE_UNASSIGNED
>  > 	  && t != G_UNICODE_PRIVATE_USE
>  > 	  && t != G_UNICODE_SURROGATE);
>  > }
>  > 
>  > I think we should perhaps only exclude G_UNICODE_FORMAT here if
>  > excluding classes of Unicode characters is the right
>  > technique. Non-printed characters in the output stream are more
>  > confusing than boxes or "unknown character" glyphs when the user
>  > starts editing.
> 
> I did expect that this would lead to some discussion of character
> classes.
> 
> Excluding controls doesn't fit with "single paragraph mode".
> 
> Excluding unassigned characters is safe, but doesn't future-proof
> Pango, since gunichartables.h is based on Unicode 3.0.1 in GLib 1.3.6,
> on Unicode 3.1 in the unreleased GLib 1.3.7., and gunichartables.h
> will change again when the next version of Unicode is released
> (probably early next year).
> 
> Excluding private use characters prohibits private use graphic
> characters, so I never was comfortable with doing that.
> 
> Whether to exclude surrogates depends on where Pango (and GLib in
> general) comes down on the UTF-8/UTF-8s question that's roiling
> Unicode at present.  Certainly an unpaired surrogate is an error, and
> under the current Unicode 3.1 definition, characters outside the BMP
> should be written in UTF-8 as a four-byte sequence, not as two
> three-byte surrogate code points, so I'd favour also excluding
> surrogate code points.
> 
> So, yes, excluding only format characters is preferable, but also
> excluding surrogates may be reasonable.

Surrogate characters never get here ... pango_layout_set_text() now
calls g_utf8_validate() which rejects strings containing surrogate
characters encoded as UTF-8.

Basically, the decision here about what characters to exclude or not
to exclude here is not a decision whether they are valid, but a
decision whether to display the characters as a "unknown character
box" or to simply hide them. Simply hiding the characters only makes
sense when they are something like bidirectional control character
... something that was meaningful in the input stream, but has
no visual representation.

Something like an unassigned character or a private-use-area character
is best presented to the user as "I don't know what this is".

A good example of what happens when you just ignore junk characters
is shown by the switch in recent XFree86 to use boxes instead of 
empty zero-width characters for control-characters in its fonts.
Suddently, all sorts of places where random control characters ended
up in the stream are showing up. 

Regards,
                                        Owen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]