Re: [xml] Libxml/Python/Unicode problem

Örjan Reinholdsen wrote:

What is going on here? Why is the python-binding trying to encode into ascii at all?
Or is it just me misunderstanding something here...

The libxml Python binding in fact does not accept Python unicode strings.

Libxml internally works with UTF-8 encoding, and its Python API also expects UTF-8.

The trick here is to convert your strings to UTF-8 before you pass them
into libxml, and to convert them back again from it into unicode
strings when you get them. Perhaps superfluously, you do this
like this:

utf8string = unicodestring.encode('UTF-8')

and back again

unicodestring = unicode(utf8string, 'UTF-8')

I've discussed a global 'knob' for libxml in the past with Daniel that make the Python binding accept and return unicode, at the extra cost of conversion in the binding layer (to UTF-8 and back). With this knob *all* strings that enter the API are considered to be unicode string (as if python does a 'unicode()' on them). This means that the API would accept old-style strings in the ascii range as well as unicode strings,
which is the default Python behavior when handling unicode.

It may not be a bad idea to make some progress on this, as Python programmers trying to do the right thing with unicode in the Python sense now get blocked by confusion in libxml, even though libxml does the right thing too in its own terms. :)



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]