Re: [Re: [libxml++] UTF8 support]



Hi Rasmus,

you are listing important requirements, which, however, apply only
partly the libxml++ (the other half applies to the unicode library).

Rasmus Kaj wrote:

 1. The library should efficiently cope with any legal XML.

agreed.

 2. I want to be able to get a std::string (with a specified encoding)
    of a value.  Most of the XML I will handle will be ISO8859-15
    (thats just me, but replace the encoding with other encodings and
    a lot will be covered).

That is a requirement for the unicode lib: as libxml2 uses one particular encoding internally (utf8), which you want to be able to
'transcode' into another.

 3. When necesarry, I want to be able to get a std::wstring (a
    std::string of wide characters).  I want to be able to handle
    individual characters in the string, and I want to use standard
    C++.  Then this is the "correct" way of handling unicode strings.

that, too, is an issue the unicode library has to deal with:
I take it that with 'standard C++' you mean you want to be able to
access characters with the '[]' operator. That is a requirement for
the specific encoding you use. With utf8 characters don't have a fixed
size, so you don't have random access. Instead you have to iterate
over the string to find the nth character.

So, depending on what you want to do with the string, one encoding
may be better than another.

Please note that there is no way for  unicode to fit into std::wstring,
as that has >16 bit, while unicode needs 21 bits per character. Some
'planes' fit into these 16 bit, but for lots of characters you need
more, so the encoding becomes variably sized (meaning, as explained
above, there is no random access).

 4. Sometimes, it would probably be nice to get utf8 data as well.

yep.

The ordering of the points reflects their relative importance to me
personally, but I believe the following conclusions to hold regardless
of how the points are reordered:

Point one suggests that the library should use utf8 internally.  I
don't care much how this is done, but some kind of refptr<char*> seems
relatively sane while putting utf8 data in a std::string feels bad.

agreed.

Points 2 - 4 suggests that the get_ / set_ string methods should be
templated on string class and conversion method.  The alternative is
that I would have to call a converter at every call to a get_ / set_
string method, meaning in practice that I would have to write a
wrapper around libxml++ (which would feel ridiculous, since libxml++
itself is a wrapper around libxml).

agreed.

Note that this _doesnt_ imply that the entire libxml++ would go from a
"dynamic library" to a "template library".  Just that some of the
methods would be templates.  Also, "common instatiations" of a
template method can - at least threoreticlly - be included in a
dynamic library.

well, most methods deal with strings. I originally tried to factor
out the string-type agnostic part into a base class, but that didn't
lead anywhere.

I agree that it would be possible to compile specific 'unicode bindings'
to deal with Murray's points about interface/implementation separation.
Whether that's actually worth the efford is another story.

Regards,
		Stefan





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]