Re: Editing and formatting characters

From: Owen Taylor <otaylor redhat com>
To: gtk-i18n-list gnome org
Subject: Re: Editing and formatting characters
Date: 15 Nov 2000 14:31:35 -0500
Pablo Saratxaga <pablo mandrakesoft com> writes:

> On Tue, Nov 14, 2000 at 05:35:45PM -0500, Owen Taylor wrote:
>  
> > We recently ran into some bugs with GTK+ and Pango with the handling
> > of formatting characters. And it occurred to me that I never had
> ...
> 
> >  ZWNBS - zero width non-break space
> (jsut out of curiosit, what is the purpose of that one??)

It basically inhibits a line break at the position. (Section 13.2 in
the Unicode Standard v3.0.) The example they give is if you want
to display "base+delta" and inhibit line breaks around the + character,
you can use the text "base[ZWNBS]+[ZWNBS]delta".

> > In the following, I suggest one way of handling formatting characters
> > during cursor positioning and deletion. Implementing this in
> 
> > As an example, lets use the example that has been coming up on this
> > list over the last few weeks:
> > 
> >    HEBREW TEXT [HYPHEN] [RLM] 1234
> >                        ^     ^
> >                        A     B
> > 
> > There should not be cursor positions at A and B because that would
> 
> I don't agree.
> 
> If we want to be able to manipulate those chars then they *must* be visible,
> and both A and B cursor positions *must* be possible.
> 
> However, I agree that it is not the normal behaviour.
> 
> So we need that both display and editing (cursor and thing) have two modes:
> 1. normal
> 2. "show control and formatting characters"
> 
> And all text editors should be able to provide a way to tell pango to
> switch between those two modes.

I'd agree that visible display is necessary if someone is working
extensively on with the embedding formatting characters. Its quite 
a bit of work, though, and I don't I'll be able to do it for
Pango-1.0 / GTK+-2.0
 
> Pango would also need to have a set of glyphs to display those special chars
> (and the choice of glyphs should be thought a bit, as it must be done in
> a way to make them easily recognizable by users to understand what they do.
> Maybe widespread fonts already provide such glyphs or unicode recomends
> the use of a given glyph in some cases?)

Well, one thing that Pango can do is to use multiple glyphs for the
character and treat them as a single character for cursor position,
selection, line-breaking, etc.

So, lacking suitable fonts, Pango could simply display '[RLM]' or
something like that. There are no real standard glyphs for these
characters, though the Unicode reference uses forms like:

 +--+
 |ZW|
 |NJ|
 +--+

> > result in having to hit the arrow key twice to at that position, which
> > would be very non-inituitive. (Not to say that invisible formatting
> > marks is intuitive to begin with...)
> 
> So, in normal mode, only one arrow key would be necessary, and the formatting
> mark would only be deleted when the "associated char" would be (there must
> be a way to decide which one it is; the concept of "associated char" would
> be somewhat similar of the one used in Thai editing I think).

Well, we can always "make a decision", but it would sometimes be
completely arbitrary. The clearest example of this is a ZWNJ between
two Arabic characters.
 
> And in "show special chars" mode, the RLM mark will be shown just as another
> char, so copianble, deletable, etc.
> 
> > I think the correct cursor position is at B - that is, the mark should
> > be associated with the logically-preceding character.
> 
> not always.
> consider this example: <space> <ZWJ> <arabic letter>
> here the ZWJ should be associated with the arabic letter, not with the
> space; as it is evident that the purpose of the use of ZWJ in such case
> is to modify the behaviour of the next char, not the previous one.
> 
> In your case I agree that RLM should be associated with the hyphen, but
> not for the reason you give.

I see the point here, but would worry that the rules we would come up
with would be exceedingly complex and the user would see the behavior
as being essentially random. A consistent behavior may in some cases
be better than an intuitive behavior.
 
> I think the algorithm should be grossly to associate the formatting char to
> the nearest char it modifies its behaviour, if both previous and next char
> are modified same way, associate it with the next one, if it is a change
> of direction, associate it with the char being neutral, if none, to the one
> being of the opposite directionality, if none, to the next char.
> 
> In your example the global direction was RTL, then the hyphen changed it to
> neutral (I think, I woudl need confirmation here), then the RLM changes
> it again to RTL; then there are the digits, which are LTR.

Actually, the Unicode directional algorithm isn't really as simple as
an idea of a "global direction" that changes as you go along the text.
A gross simplification would be to say that it starts with the
characters that have a definite direction and then assigns directions
to neutral characters based on the characters with definite
directionality around them.

> So, both the hyphen and the first digit are modified --> attach to
> the previous char.
> But now consider the same *without* the hyphen:
> 
> HEBREW TEXT <RLM> 1 2 3 4
> 
> Here, RLM doesn't change the behaviour of the last Hebrew letter, so it
> must be associated with "1" IMHO.

Well, in this sequence the RLM has no effect on the output at all,
so it is hard to say what it should be associated with.
 
> I think the formatting chars must be associated with the next char because
> they influence the behaviour of what is after them; however, some of them
> have influence on both sides (ZWJ, ZWNJ, ZWNBS,...), they should be associated
> with the side they really modify, if they modify both sides, then associate
> it with th next char (but I'm not 100% sure about that).
> some other chars modify the global direction (LRM, RLM), they must be
> associated to the neutral char as it is the one that is modified;
> if both sides are neutral, associate to the next one. If none is neutral,
> associate it with the char that has opposite directionality, if both
> are opposite, associate with the next one; if both sides have same
> directionality, associate with the next char.
> 
> Does that makes sense ?
> 
> > The question is what should happen when you hit Delete when the cursor
> > position is at B.
> 
> You mean, the BackSpace key I presume ?

Yes, sorry about confusing things. I mean Backspace.

> The RLM beign associathe with the hyphen, not with the numbers, deleting
> the numbers should have no effect on the RLM mark.
> 
> > The two possibilities are that only the RLM is
> > deleted, or both the RLM and the HYPHEN are deleted. I believe that
> > the better behavior is to delete both the HYPHEN and the RLM.
> 
> That depends.
> First, that is valable only in "normal" mode; in "display control codes"
> it is shown as a char of its own and the user can do as usual.
> So, in "normal" mode, the behaviour should be, imho:
> For modifiers that only act on adjacent char, yes, delete both the
> associated char and the formatter.
> For LRM and RTL mark, however, they have also effects on other chars
> that just the adjacent one; so the formatter char should be kept as
> long as its effect remains; eg:
> 
> HEBREW TEXT [HYPHEN1] [HYPHEN2] [HYPHEN3] [RLM] x 1234
> ('x' being the cursor position).
> 
> hitting backspace should delete the hyphen3 but *not* the RLM; which will
> then be associated to the hyphen2 and so on; only at deleting hyphen1
> will the RLM be deleted too.
> That is, if the associated char is deleted, the two new neighbors should
> be evaluated to see if they stand at a same or higher priority
> (same directionality>opposite directionality>neutral), if yes, keep them
> (and attach to the correct new neighbor) if not, delete it too. 

Something like this might be appropriate, but is it is not going to be
at all simple to describe the algorithm, so again I worry that users
won't be able to figure out what is going on.

> > The difference with the range-affecting characters is that the
> > paired nature must be maintained. For this example, we'll an alternate
> > encoding of the above as:
> > 
> >    [RLE] HEBREW TEXT [HYPHEN] [PDF] 1234
> >                                    ^
> >                                    B
> > 
> > Again the cursor cursor position should be at B, but when we delete at
> > B, we don't want to simply delete the PDF, we instead want to delete
> > the HYPHEN and leave the [PDF]. Then when we get to:
> 
> Mmmh, I have a doubt here... does the RLM and LTR mark change the global
> direction of all following chars or not ?
> In other words, are RLM and LTR "range formatters without explicit ending" ?

I don't follow exactly - RLM and LRM have no effect on ranges, except
implicitly via the Unicode bidirectional algorithm.

Thanks for the comments. I'll try later to come up with some sort of 
"rules for figuring the associated character" and see how well
that works out.

Regards,
                                        Owen
Follow-Ups:
- Re: Editing and formatting characters
  - From: Derek Simkowiak
References:
- Editing and formatting characters
  - From: Owen Taylor
- Re: Editing and formatting characters
  - From: Pablo Saratxaga
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]