Re: [Re: [Re: [libxml++] UTF8 support]]



Rasmus Kaj <kaj e kth se> wrote:
> 
> This discussion seems to have gone a bit out of hand, but anyway,
> here's my view of it.  These are the main features I want regarding
> character conversion:
> 
>  1. The library should efficiently cope with any legal XML.

libxml should do that for us.
 
>  2. I want to be able to get a std::string (with a specified encoding)
>     of a value.  Most of the XML I will handle will be ISO8859-15
>     (thats just me, but replace the encoding with other encodings and
>     a lot will be covered).

Glib::ustring would do that. There are implicit conversions between
Glib::ustring and std::string so you can use a std::string wherever you use a
Glib::ustring. ASCII is a subset of UTF8 so a std::string will be fine as long
as you don't try to do any character-wise operations on a std::string that
actually contains non-ASCII UTF8 data.

>  3. When necesarry, I want to be able to get a std::wstring (a
>     std::string of wide characters).

libxml uses UTF8 (variable numbers of bytes per character - just 1 byte if
it's ASCII). I'm not sure whether a wstring (2 bytes, I suppose) is even
capable of encoding all possible UTF8 text. I think there's some languages
that a wstring can't cope with. I'm not sure.

If a wstring can cope with it, it's almost certainly not simple to do a
conversion between a UTF8 string and a wstring:
1. There might be a question of the encoding of that UTF8 string (UCS2
maybe?).
2. And even if it can be done theoretically it would probably require us to
depend on a conversion library such as iconv. That might or might not be a big
deal.
3. The UTF8<->wstring conversion would probably not be efficient. It might be
so bad that it's worth using a parser that does wstring natively. libxml is
all UTF8 all of the time.
4. I believe that almost all UNIX developers will be happy with using UTF8. We
_might_ decided that Windows programmers aren't important to us.
I'm not sure what the Unicode situation is with MacOS X.
 
I have tried to make it clear above that I don't really know what I'm talking
about, but that I am aware that certain problems might exist. Please don't
flame me for not knowing everything exactly and needing people to discuss
them.

>  I want to be able to handle
>     individual characters in the string, and I want to use standard
>     C++.  Then this is the "correct" way of handling unicode strings.

Glib::ustring tries to be as much as possible like std::string while dealing
with UTF8 text instead of just ASCII. 

>  4. Sometimes, it would probably be nice to get utf8 data as well. 
> The ordering of the points reflects their relative importance to me
> personally, but I believe the following conclusions to hold regardless
> of how the points are reordered:
> 
> Point one suggests that the library should use utf8 internally.  I
> don't care much how this is done, but some kind of refptr<char*> seems
> relatively sane while putting utf8 data in a std::string feels bad.

Our internals are done by libxml. libxml++ is a wrapper of libxml. That won't
change. 

> Points 2 - 4 suggests that the get_ / set_ string methods should be
> templated on string class and conversion method.  The alternative is
> that I would have to call a converter at every call to a get_ / set_
> string method, meaning in practice that I would have to write a
> wrapper around libxml++ (which would feel ridiculous, since libxml++
> itself is a wrapper around libxml).

No, because Glib::ustring has implicit conversions to and from std::string. We
didn't want to make life difficult if you don't want to use
internationalization.

> Note that this _doesnt_ imply that the entire libxml++ would go from a
> "dynamic library" to a "template library".  Just that some of the
> methods would be templates.

I think it would affect almost all methods though.

>  Also, "common instatiations" of a
> template method can - at least threoreticlly - be included in a
> dynamic library.

Not without depending on all those possible client libraries. Depending on
both glibmm and Qt would be worth than just one.

Thanks for the input.


Murray Cumming
murrayc usa net
www.murrayc.com





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]