Re: [xml] Perl module XML::LibXML not encoding UTF-8 properly [SOLUTION]



On Mon, Sep 26, 2005 at 10:57:50AM -0700, Loren Osborn wrote:
Perl 5.6 and above use UTF-8 for all strings internally unless
explicitly told not to.  With that knowledge, it seemed natural to pass
the already UTF-8 encoded string to the XML::LibXML library.
Unfortunately, libxml2 (and by extension XML::LibXML) treated the string
I passed it as a byte stream, and not a character (or code-point)
stream.

  I am sure libxml2 expects UTF-8 in the tree APIs. I have no idea
if XML::LibXML does any conversion on top, and if yes why it does it
or to/from what. I would expect XML::LibXML documentation to tell about
this.

UTF-8 makes certain assertions about how multi-byte characters are
represented.  While this code change doesn't check all of those
assumptions, but it does ensure that all the non-first bytes have their
high bits set correctly.  This is likely to catch similar errors at
least regarding Latin characters.  If you are feeling ambitious, feel
free to check for the assertion that code-points are encoded in the
fewest number of bytes possible.  This patch is untested, but I prefer
that a developer more familiar with the libxml2 library give it a more
thorough once over. 

  that problem is that you add this check in one APIs. I am mot sure
it make sense to do this on one entry point and not all the others.
I am not sure it makes sense to add the checking to all tree APIs
this could be extremely costly at runtime.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]