Re: Explicit LRO, RLO, etc. in FT2 patch



Owen Taylor wrote at 23 Aug 2001 13:50:58 -0400:
 > Tony Graham <Tony Graham ireland sun com> writes:
 > > The following patch expands the list of characters that are
 > > effectively passed over as glyphs to include all non-graphic
 > > characters.
...
 > > Index: modules/basic/basic-ft2.c
 > > ===================================================================
 > > RCS file: /sgnome/cvsroots/GNOME/pango/modules/basic/basic-ft2.c,v
 > > retrieving revision 1.11
 > > diff -r1.11 basic-ft2.c
 > > 223c223
 > > <       if (wc == 0x200B || wc == 0x200E || wc == 0x200F)	/* Zero-width characters */
 > > ---
 > > >       if (!g_unichar_isprint (wc))	/* Zero-width characters */
 > 
 > Hmmm:
 > 
 > gboolean
 > g_unichar_isprint (gunichar c)
 > {
 >   int t = TYPE (c);
 >   return (t != G_UNICODE_CONTROL
 > 	  && t != G_UNICODE_FORMAT
 > 	  && t != G_UNICODE_UNASSIGNED
 > 	  && t != G_UNICODE_PRIVATE_USE
 > 	  && t != G_UNICODE_SURROGATE);
 > }
 > 
 > I think we should perhaps only exclude G_UNICODE_FORMAT here if
 > excluding classes of Unicode characters is the right
 > technique. Non-printed characters in the output stream are more
 > confusing than boxes or "unknown character" glyphs when the user
 > starts editing.

I did expect that this would lead to some discussion of character
classes.

Excluding controls doesn't fit with "single paragraph mode".

Excluding unassigned characters is safe, but doesn't future-proof
Pango, since gunichartables.h is based on Unicode 3.0.1 in GLib 1.3.6,
on Unicode 3.1 in the unreleased GLib 1.3.7., and gunichartables.h
will change again when the next version of Unicode is released
(probably early next year).

Excluding private use characters prohibits private use graphic
characters, so I never was comfortable with doing that.

Whether to exclude surrogates depends on where Pango (and GLib in
general) comes down on the UTF-8/UTF-8s question that's roiling
Unicode at present.  Certainly an unpaired surrogate is an error, and
under the current Unicode 3.1 definition, characters outside the BMP
should be written in UTF-8 as a four-byte sequence, not as two
three-byte surrogate code points, so I'd favour also excluding
surrogate code points.

So, yes, excluding only format characters is preferable, but also
excluding surrogates may be reasonable.

 > (Note that editing of directional formatting characters is something
 > we still haven't quite figured out how it should work.)

But that shouldn't mean that text containing the formatting characters 
can't be presented without error.

Regards,


Tony Graham
------------------------------------------------------------------------
Tony Graham                           mailto:tony graham ireland sun com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]