Re: Why does gtk_text_buffer() normalize the cluster on backspace?



Continuing on my latest posting, I investigated the definition of the "unicode canonical ordering" that is used by g_utf8_normalize() and I found the following quote from http://unicode.org/faq/normalization.html#8:

<quote>
Q: Isn't the canonical ordering for Arabic characters wrong?

A: The Unicode Standard does not guarantee that the canonical ordering of a combining character sequence for any particular script is the 'correct' order from a linguistic point of view; the guarantee is that any two canonically equivalent strings will have the same canonical order.

In retrospect, it would have been possible to have assigned combining classes for certain Arabic and Hebrew non-spacing marks (plus characters for a few other scripts) that would have done a better job of making a canonically ordered sequence reflect linguistic order or traditional spelling orders for such sequences. However, retinkerings at this point would conflict with stability guarantees made by the Unicode Standard when normalization was specified, and cannot be done now. [KW]

</end quote>

Basically it sais that the ordering of the accent characters for Arabic and Hebrew, may not be relied upon for any linguistic interpretation. I consider what character to erase when pressing backspace to be such an interpretation.

As I see it there are two ways of fixing this:

1. Add an calc_char_to_erase() routine in the pango language modules that receives a cluster and determines what character should be erased. If the language module does not define such a routine than either canonical ordering or no reordering is done.

2. Drop the canonical ordering all together.

Regards,
Dov

On 4/27/07, Dov Grobgeld <dov grobgeld gmail com > wrote:
I'm was trying to figure out why backspace does not delete the last character (accent) in the buffer when entering Hebrew text with accents, and I stumbled upon the reason in gtk+/gtk/gtktextbuffer.c:gtk_text_buffer_backspace():

      if (backspace_deletes_character)
        {
          gchar *normalized_text = g_utf8_normalize (cluster_text,
                                                     strlen (cluster_text),
                                                     G_NORMALIZE_NFD);
          glong len = g_utf8_strlen (normalized_text, -1);
         
          if (len > 1)
            gtk_text_buffer_insert_interactive (buffer,
                                                &start,
                                                normalized_text,
                                                g_utf8_offset_to_pointer (normaliz
ed_text, len - 1) - normalized_text,
                                                default_editable);
         
          g_free (normalized_text);
        }

And there's the crux. Why the normalization through the call g_utf8_normalize()? If backspace should not simply delete the last character in the buffer, shouldn't its behavior be language dependent, perhaps as part of the pango language module? In any case for Hebrew the current behavior is not logical as there are accents that imo tie stronger than other. E.g. when inserting:

   U+5D1 Hebrew Letter Bet
   U+05BC Hebrew Point Dagesh or Mapiq
   U+05B8 Hebrew Point Qamats

the dotting of the BET (Mapiq) logically ties stronger than the vowel mark Qamats (to such an extent that fonts often provide a different special glyph for the combination Bet/Mapiq), but backspace currently first erases the Mapiq. The reason is probably that Mapiq has a higher unicode code point than the Qamats... This e.g. breaks the open type table Bet/Mapiq ligature as the characters are no longer adjacent. Of course one may build more sophisticated opentype tables, but this seems quite roundabout...

Regards,
Dov




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]