Re: CORBA_char vs CORBA_wchar

From: Bill Haneman <Bill Haneman Sun COM>
To: Bill Haneman Sun COM, michael ximian com
Cc: orbit-list gnome org
Subject: Re: CORBA_char vs CORBA_wchar
Date: Fri, 20 Jul 2001 11:56:24 +0100 (BST)
Hi list:

I am forwarding a reply to a conversation between Michael and myself.  
The short background is this:

The accessibility SPI needs to deal in internationalized strings.  
However Michael believes ORBit's wchar/wstring may be broken, at the 
least it is untested and inconvenient when one considers that the 
strings Gnome uses internally are UTF-8.

So Michael suggests using CORBA_string and CORBA_char.  This works fine 
for ORBit since the strings are all UTF-8 anyhow, the marshalling 
process doesn't know anything about this.  But the accessibility SPI 
needs to interoperate with accessible Java apps and the Java ORB, and 
Java uses UCS-2.  My concern is that the use of 'char' and 'string' for 
UTF-8 is nonstandard in CORBA and there is no guarantee that the Java 
ORB will correctly convert between UTF-8 and UCS-2 internally when 
transmitting and receiving (non-wide) strings over the wire protocol.

Does anyone out there have any hard info about this, or about whether 
the use of CORBA 'string' for UTF-8 is nonstandard/broken/dangerous?

Thanks,

Bill

>Hi Bill,
>
>On Thu, 19 Jul 2001, Bill Haneman wrote:
>> right...  however I am still a little concerned about the interaction  
>> with CORBA environments and ORBs that don't use UTF8, like Java.
>
>        I don't understand the concern at all. The fact that a string 
is
>encoded in UTF-8 is totaly transparent to Java - a utf-8 string is
>indistinguishable from any 8bit string - and I'm sure Java supports 
them.

Hi Michael:

I think this is where our understandings diverge.  If you give a UTF-8 
string to Java it won't know what to make of it unless you explicitly 
tell it to use the UTF-8 encode/decode functions.  So in that sense it 
is not 'transparent'.

In the Java-IDL bindings spec (and vice-versa, from IDL to Java) all 
Java chars/strings are wchars/wstrings.  On the one hand that means that 
both wstring and string (from IDL) are stored as java.lang.String - but 
I think the marshalling is different, and that's an area where 
documentation is pretty sparse.  It is not clear that a UTF-8 string, 
passed to the Java ORB via "string" in the IDL, will properly be 
converted to UCS-2.

At any rate the CORBA docs I have (though I have not exhaustively read 
the OMG specs :-P ) unanimously indicate that unicode/internationalized 
stuff should use wstring/wchar.

I agree with you that in the case of ORBit the UTF-8 can be seen to work 
transparently, no problem.  But since there are potentially many 
different character encodings used by ORBs and language bindings (such 
as the Java UCS-2 example) it is not clear that the marshalling will 
correctly do the character conversions if the spec is silent on the 
UTF-8 issue.  If the marshalling and demarshalling routines don't know 
to look for UTF-8, how can they be expected to work?

>> Not yet, I am still using ORBit-Martin-forked and so had to hack
>> Makefile.am a little (to include ORBitCosNaming-2 and ORBitutil-2   
>> which are separate in O-m-f:  maybe my .pc file is not right?).
>
>        Hmm, if you upgrade to ORBit2 you want to kill the ORBitutil
>library, it really screwed me up.

Ya, but without it the ORBit-2 libs in ORBit-Martin-forked don't resolve 
all the symbols needed.  So in this respect ORBit-2 and O-m-f seem 
incompatible.

>>  I will put it in CVS ("at-spi") once I have a reasonably portable
>> Makefile.am.  ORBit-2 is not building on Solaris yet, but the other
>> option is for me to pull ORBit-2 onto my linux box, which would fix
>> Makefile.am, and upload it to CVS sooner (today or tomorrow).
>
>        Sounds great - there's no real need for it to work to put it in
>CVS, I'd just whack it in and watch it get fixed :-)

<grin>

>> I have confirmed however that it is not standard to assume that
>> "string"  in CORBA IDL is potentially UTF-8, and in fact this will      
>> break for many ORBs.
>  
>        I just don't believe this whatsoever I'm afraid. UTF-8 is
>indistinguishable from a std. 8 bit charset. That is unless you're 
telling
>me that a CORBA string is only 7bit ASCII ? - hmm.

I think that would be an odd implementation indeed.  My concern is more 
that the conversion from the 8-bit CORBA wire protocol to some non-UTF-8 
character encoding (like Java's) will not work right for UTF-8 'wide' 
chars.
  
>>  When UTF-8 one is supposed to use wstring.
>  
>        If you put utf-8 in a wstring, you simply waste every other 
byte,
>if you convert it to UCS-2 you loose information and increase size most
>likely.

One does what one has to do.  Unless CORBA says 'chars/strings are UTF-8 
aware' then one has no choice if the string data can contain multibyte 
chars.

>>  If we use "string" with the Java ORB, and then call "read/write 
UTF8"
>> to do the conversions to and from the Java UCS-2 strings, it may or
>> may not work - certainly it would be a hack.
>  
>        This I think is more a function of the Java ORB converting all
>strings to UCS-2 - I'd prefer to isolate this horrible inefficiency 
inside
>Java's in-process string handling, than push huge chunks of wasted 
space
>across the CORBA wire for every string.

Hmm, it may be inefficient but again, if CORBA/OMG is silent on UTF-8 
awareness then I don't think we have a choice!  This would be fine if we 
knew that the ORB's conversion from CORBA 8bit wire protocol was correct 
for UTF-8 even in environments that use other character sets internally.

>> My conclusion so far is that if we go with "string" in the IDL it 
will
>> be broken and have to be changed, but we should not go changing IDL 
in
>> that significant a way "after the fact".  Ugh.
>
>        I really don't believe it is broken - _really_, I simply can't
>understand any possible way in which it can go wrong reliably. The 
worst
>case scenario is that in Java land you have the slightly confusing
>scenario of having a wastefuly large UCS-2 string with a UTF-8 string 
as
>every other byte.
>
>        Contrast this with the evils of in every C program, having to
>convert perfectly good UTF-8 strings to UCS-2 + extra alloc / frees per
>call, and then back again at the other end. It seems grotesque.
>
>        Surely you can't be serious ? 8) but maybe we should take this 
to
>ORBit-list, and get some input on how UTF-8 will break many ORBs.
>

I'm afraid I am serious.  If someone can put my concerns at rest 
regarding UTF-8 marshalling *generally* for ORBS, then I would be very 
happy.  I agree that using UTF-8 in the servants is much better than 
doing wchar conversions for every string, if we can get away with it.

-Bill

>	Regards,
>
>		Michael.
>
>-- 
> mmeeks@gnu.org  <><, Pseudo Engineer, itinerant idiot
>

------
Bill Haneman x19279
Gnome Accessibility / Batik SVG Toolkit
Sun Microsystems Ireland
Follow-Ups:
- Re: CORBA_char vs CORBA_wchar
  - From: Elliot Lee
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]