Editing and formatting characters



We recently ran into some bugs with GTK+ and Pango with the handling
of formatting characters. And it occurred to me that I never had
really thought through how formatting characters embedded in Unicode
text should behave during editing.

By a formatting character, I mean a character that generally has
no-on-screen representation, either as a glyph, or a space.  The
important formatting characters in Unicode basically fall into two
classes;

Individual characters 
 
 RLM   - right to left mark
 LRM   - left to right mark
 ZWJ   - zero width joiner
 ZWNJ  - zero width non-joiner
 ZWNBS - zero width non-break space

And characters that work on ranges:

 LRE    - left to right embedding
 RLE    - right to left embedding
 LRO    - left to right override
 RLO    - rigth to left override
 PDF    - pop direction formatting

There are also a number of other range characters that are strongly
deprecated, so I'll ignore them.


In the following, I suggest one way of handling formatting characters
during cursor positioning and deletion. Implementing this in
GTK+ and Pango is going to be a bit tricky, but doable, so what
I'm interested in is whether people think the behavior is right.


Lets consider the individual characters first. Since they have no
graphical reprentation in the output stream, there should not be a
separate cursor position corresponding to them.

As an example, lets use the example that has been coming up on this
list over the last few weeks:

   HEBREW TEXT [HYPHEN] [RLM] 1234
                       ^     ^
                       A     B

There should not be cursor positions at A and B because that would
result in having to hit the arrow key twice to at that position, which
would be very non-inituitive. (Not to say that invisible formatting
marks is intuitive to begin with...)

I think the correct cursor position is at B - that is, the mark should
be associated with the logically-preceding character.

The question is what should happen when you hit Delete when the cursor
position is at B. The two possibilities are that only the RLM is
deleted, or both the RLM and the HYPHEN are deleted. I believe that
the better behavior is to delete both the HYPHEN and the RLM. That
is, the marks should be as invisible to editing as possible. 
If greater control is needed, then a special "view invisible characters"
mode should be available.


The difference with the range-affecting characters is that the
paired nature must be maintained. For this example, we'll an alternate
encoding of the above as:

   [RLE] HEBREW TEXT [HYPHEN] [PDF] 1234
                                   ^
                                   B

Again the cursor cursor position should be at B, but when we delete at
B, we don't want to simply delete the PDF, we instead want to delete
the HYPHEN and leave the [PDF]. Then when we get to:
 
   [RLE] H [PDF] 1234
                ^
                B

Hitting delete would then delete PDF, H and RLE.


Does this sound like the right behavior?

                                        Owen





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]