Re: UrShapes & UTF 8

From: Cyrille Chepelov <chepelov calixo net>
To: dia-list gnome org
Subject: Re: UrShapes & UTF 8
Date: Sun, 5 Aug 2001 12:26:55 +0200

Le sam, aoû 04, 2001, à 01:27:28 -0400, John Palmieri a écrit:

    Hey Cyrille.  I saw your post about the UML changes and it occured 
to me that I'm not doing much in UrShapes to support UTF8.  Actualy I 
have done nothing because I am not totaly shure how Gnome/GTK+/Dia 
handles UTF8.  First, can you point me to the docs and second can you 
briefly go over how Dia incorporates UTF8.  I would like to start 
placing support in UrShapes so that I don't have to revisit the problem 
later.


Most objects won't really care. If all you do to strings is g_strdup
(strcpy), strcat, strlen and g_free them, then you don't need to worry about
UTF-8 at all. 

What is taboo is doing stuff like:
        for (char *p=str; *p ; p++) {
          if (*p == 'A') foo();
          if (*p == 'B') bar();
        }

      or 
        if (str[7] == '\\') str[7] = 'k';

When that happens, you need to use libunicode's routines (they all have
unicode_ prefixes -- look them up in <unicode.h>) But please don't use them 
directly ! Include "charconv.h" and use the variants with uni_ prefix (these 
will do the right thing in the case dia is compiled in a unicode-less platform, 
and work on local(ASCII+) characters.

So, the above code snippets should be converted to something like:
        unichar uc;
        for (char *p=str, char *pn=uni_get_utf8(p,&uc);
             *p; 
             p = pn, pn=uni_get_utf8(p,&uc)) {
            if (uc == 'A'/* or a 32-bit UCS-4 constant */) foo();
            if (uc == 'B') bar();
        } /* this uses C99 syntax ; adapt the syntax and/or idiom to 
             C90 for dia */
        
        or

        unichar uc;
        utfchar *s7,*s8;
        s7 = &str[uni_offset_to_index(str,7)];
        s8 = uni_get_utf8(s7,&uc);
        
        if (uc == '\\') {
            utfchar *newstr;
            *s7 = 0;
            newstr = g_strconcat(str,"k",s8,NULL);
            g_free(str);
            str = newstr;
        }

        /* yes, that used to be a one-liner ... */

Also, a few details. The sources *must*, for the time being, remain pure
ASCII files (we might relax the rule sometime, when UTF-8 editors and
consoles become the norm, rather than the painful exception). This applies
to string constants too ! These constants must be encoded as UTF-8 strings,
with proper C escaping so they are expressed in pure ASCII:
        "mélange" becomes  "m\xc3\xa9lange", not "mÃ©lange"
(do an `echo "string" | iconv -t utf-8 | od -a -t x1` to get the
         string's value)

Now, on to how I plan to move dia to UTF-8. It's going to happen in three
stages. 
        Phase 0 is what we have now: the core (core proper + binary
objects), and most libraries (including GTK) talk local character set. Some
modules are already talking UTF-8 internally, but it's their business;
interfaces still talk local (in the diagram, when a module overlaps another,
it's supposed to mean that the charset used in that interface is defined by
the overlapping module. White modules talk local, gray ones talk UTF-8).

As I review code for UTF8-ness (and StdProp-ness), I put (but not test)
UTF-8 equivalents of the non-clean code in UNICODE_WORK_IN_PROGRESS blocks.
When I begin reviewing what's in app/, I'll also define another symbol,
GTK_TALKS_UTF8. It's still unclear to me whether this is the case with the
version of gtk used on Windows (Hans ?), but with gtk 1.x on X, it's not the
case.

When I find individual modules which could go UTF-8 without imposing changes
on the rest of the code, I convert them right away when it's sensible to do
so (this happened yesterday for the "stereotype" module, which is a part of
UML support. The objects themselves still see local strings).

The code when (defined(UNICODE_WORK_IN_PROGRESS) && !defined(GTK_TALKS_UTF8))
will be here for when there's a charset impedance mismatch between the code
and GTK. This should be a short period of time, but in that period we'll
have to pay an iconv() step for all data from/to entry widgets.
(actually, I'll define new symbols like GTK_TALKS_UTF8_WE_DONT,
GTK_DOESNT_TALK_UTF8_WE_DO and GTK_CHARSET_MISMATCH to make the situation clear.
This will appear in config.h where it belongs, through configure.in games).

For UrShapes (by the way, what means "Ur" ?), I think you should decide
between managing two code paths (one for UNICODE_WIP and one current), or
decide that your code is already using UTF-8, and then do a local/UTF8 
conversion at your boundary with dia (if !defined(UNICODE_WORK_IN_PROGRESS))
and one adaptation step with (!defined(GTK_TALKS_UTF8)) when you're dealing
with GTK. (If you're not talking UTF-8 in your module, you still have to do
a pair of GTK_TALKS_UTF8_WE_DONT and GTK_DOESNT_TALK_UTF8_WE_DO steps when
you talk to GTK, so I definitely advise you to talk UTF-8 or be charset
agnostic...)

-----------

Phase I begins the day UNICODE_WORK_IN_PROGRESS is enabled by default (and
finally tested. I'm not even sure it compiles today...). Once the dust
falls, I'll start removing the non-UNICODE_WORK_IN_PROGRESS code path, so at
the end of phase I, we won't have the option of running without unicode
(except by running without HAVE_UNICODE). At that time, running without
unicode will be somewhere between unsupported, frown upon, or disabled.

During phase I, most of the code will run using UTF-8 strings ; notable
exceptions will be GTK 1.x (still the usual mess), and libxml1 (we'll use
the libxml2 parser of recent libxml1, but the write path is still the old
one, which for legacy support reasons will still be used in local
character set).

---------------
Phase II begins when we switch to GTK 2.0 (or together with phase I on
Windows...). We finally dump libxml1. Everything talks UTF-8, charconv.h
becomes practically empty, and building with Unicode support is mandatory.
The UTF-8 part of that switch will be pretty much painless (mostly remove
the "impedance mismatch" support), but there's going to be a lot of other
changes (because the GTK2 API is different from the GTK1 one).

Once we switch to gtk2 (or while we do), we'll be able or we'll have to
change to using GObject in objects. I have no details on that part, and this
is rather orthogonal to the changes related to UTF-8.

---------------

Will the above plan be followed, and how will it relate to releases ? Well,
it seems that James has no time to make releases, currently. I have no
visibility after August 20th (New employer. I may have all kind of
situations after that date, ranging from still some time and energy to play
with dia, to absolutely no time). etc. So, there are a lot of
uncertainities.

        -- Cyrille

-- 
Grumpf.

Attachment: dia-transition.dia
Description: Binary data

References:
- UrShapes & UTF 8
  - From: John Palmieri

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]