Re: CORBA_char vs CORBA_wchar
- From: Bill Haneman <Bill Haneman Sun COM>
- To: Bill Haneman Sun COM, michael ximian com
- Cc: orbit-list gnome org
- Subject: Re: CORBA_char vs CORBA_wchar
- Date: Fri, 20 Jul 2001 11:56:24 +0100 (BST)
Hi list:
I am forwarding a reply to a conversation between Michael and myself.
The short background is this:
The accessibility SPI needs to deal in internationalized strings.
However Michael believes ORBit's wchar/wstring may be broken, at the
least it is untested and inconvenient when one considers that the
strings Gnome uses internally are UTF-8.
So Michael suggests using CORBA_string and CORBA_char. This works fine
for ORBit since the strings are all UTF-8 anyhow, the marshalling
process doesn't know anything about this. But the accessibility SPI
needs to interoperate with accessible Java apps and the Java ORB, and
Java uses UCS-2. My concern is that the use of 'char' and 'string' for
UTF-8 is nonstandard in CORBA and there is no guarantee that the Java
ORB will correctly convert between UTF-8 and UCS-2 internally when
transmitting and receiving (non-wide) strings over the wire protocol.
Does anyone out there have any hard info about this, or about whether
the use of CORBA 'string' for UTF-8 is nonstandard/broken/dangerous?
Thanks,
Bill
>Hi Bill,
>
>On Thu, 19 Jul 2001, Bill Haneman wrote:
>> right... however I am still a little concerned about the interaction
>> with CORBA environments and ORBs that don't use UTF8, like Java.
>
> I don't understand the concern at all. The fact that a string
is
>encoded in UTF-8 is totaly transparent to Java - a utf-8 string is
>indistinguishable from any 8bit string - and I'm sure Java supports
them.
Hi Michael:
I think this is where our understandings diverge. If you give a UTF-8
string to Java it won't know what to make of it unless you explicitly
tell it to use the UTF-8 encode/decode functions. So in that sense it
is not 'transparent'.
In the Java-IDL bindings spec (and vice-versa, from IDL to Java) all
Java chars/strings are wchars/wstrings. On the one hand that means that
both wstring and string (from IDL) are stored as java.lang.String - but
I think the marshalling is different, and that's an area where
documentation is pretty sparse. It is not clear that a UTF-8 string,
passed to the Java ORB via "string" in the IDL, will properly be
converted to UCS-2.
At any rate the CORBA docs I have (though I have not exhaustively read
the OMG specs :-P ) unanimously indicate that unicode/internationalized
stuff should use wstring/wchar.
I agree with you that in the case of ORBit the UTF-8 can be seen to work
transparently, no problem. But since there are potentially many
different character encodings used by ORBs and language bindings (such
as the Java UCS-2 example) it is not clear that the marshalling will
correctly do the character conversions if the spec is silent on the
UTF-8 issue. If the marshalling and demarshalling routines don't know
to look for UTF-8, how can they be expected to work?
>> Not yet, I am still using ORBit-Martin-forked and so had to hack
>> Makefile.am a little (to include ORBitCosNaming-2 and ORBitutil-2
>> which are separate in O-m-f: maybe my .pc file is not right?).
>
> Hmm, if you upgrade to ORBit2 you want to kill the ORBitutil
>library, it really screwed me up.
Ya, but without it the ORBit-2 libs in ORBit-Martin-forked don't resolve
all the symbols needed. So in this respect ORBit-2 and O-m-f seem
incompatible.
>> I will put it in CVS ("at-spi") once I have a reasonably portable
>> Makefile.am. ORBit-2 is not building on Solaris yet, but the other
>> option is for me to pull ORBit-2 onto my linux box, which would fix
>> Makefile.am, and upload it to CVS sooner (today or tomorrow).
>
> Sounds great - there's no real need for it to work to put it in
>CVS, I'd just whack it in and watch it get fixed :-)
<grin>
>> I have confirmed however that it is not standard to assume that
>> "string" in CORBA IDL is potentially UTF-8, and in fact this will
>> break for many ORBs.
>
> I just don't believe this whatsoever I'm afraid. UTF-8 is
>indistinguishable from a std. 8 bit charset. That is unless you're
telling
>me that a CORBA string is only 7bit ASCII ? - hmm.
I think that would be an odd implementation indeed. My concern is more
that the conversion from the 8-bit CORBA wire protocol to some non-UTF-8
character encoding (like Java's) will not work right for UTF-8 'wide'
chars.
>> When UTF-8 one is supposed to use wstring.
>
> If you put utf-8 in a wstring, you simply waste every other
byte,
>if you convert it to UCS-2 you loose information and increase size most
>likely.
One does what one has to do. Unless CORBA says 'chars/strings are UTF-8
aware' then one has no choice if the string data can contain multibyte
chars.
>> If we use "string" with the Java ORB, and then call "read/write
UTF8"
>> to do the conversions to and from the Java UCS-2 strings, it may or
>> may not work - certainly it would be a hack.
>
> This I think is more a function of the Java ORB converting all
>strings to UCS-2 - I'd prefer to isolate this horrible inefficiency
inside
>Java's in-process string handling, than push huge chunks of wasted
space
>across the CORBA wire for every string.
Hmm, it may be inefficient but again, if CORBA/OMG is silent on UTF-8
awareness then I don't think we have a choice! This would be fine if we
knew that the ORB's conversion from CORBA 8bit wire protocol was correct
for UTF-8 even in environments that use other character sets internally.
>> My conclusion so far is that if we go with "string" in the IDL it
will
>> be broken and have to be changed, but we should not go changing IDL
in
>> that significant a way "after the fact". Ugh.
>
> I really don't believe it is broken - _really_, I simply can't
>understand any possible way in which it can go wrong reliably. The
worst
>case scenario is that in Java land you have the slightly confusing
>scenario of having a wastefuly large UCS-2 string with a UTF-8 string
as
>every other byte.
>
> Contrast this with the evils of in every C program, having to
>convert perfectly good UTF-8 strings to UCS-2 + extra alloc / frees per
>call, and then back again at the other end. It seems grotesque.
>
> Surely you can't be serious ? 8) but maybe we should take this
to
>ORBit-list, and get some input on how UTF-8 will break many ORBs.
>
I'm afraid I am serious. If someone can put my concerns at rest
regarding UTF-8 marshalling *generally* for ORBS, then I would be very
happy. I agree that using UTF-8 in the servants is much better than
doing wchar conversions for every string, if we can get away with it.
-Bill
> Regards,
>
> Michael.
>
>--
> mmeeks@gnu.org <><, Pseudo Engineer, itinerant idiot
>
------
Bill Haneman x19279
Gnome Accessibility / Batik SVG Toolkit
Sun Microsystems Ireland
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]