RE: [xml] Perl module XML::LibXML not encoding UTF-8 properly [SOLUTION]



I appreciate your feedback, but unfortunately it didn't give me any
additional warnings or errors.  Fortunately I *DID* figure out both the
cause and a solution.  Additionally I'd like to propose a code change to
catch most instances of this problem in the future.

First, the cause:

Perl 5.6 and above use UTF-8 for all strings internally unless
explicitly told not to.  With that knowledge, it seemed natural to pass
the already UTF-8 encoded string to the XML::LibXML library.
Unfortunately, libxml2 (and by extension XML::LibXML) treated the string
I passed it as a byte stream, and not a character (or code-point)
stream.

Secondly, the solution:

As "ISO-8859-1" is identical to Unicode for code-points from 0x00 to
0xFF, you can use "ISO-8859-1" to double-UTF-8 encode the UTF-8 encoded
string, so that when libxml2 treats the resulting code-points as bytes
*IN* a UTF-8 stream, it produces the proper result. 

    So instead of creating the text node with:

        XML::LibXML::Text->new($sValue)

    Use this instead:

        XML::LibXML::Text->new(encodeToUTF8( "ISO-8859-1",$sValue))

    This corrects the problem. And I suspect you need to
decodeFromUTF8() when retrieving values from XML::LibXML.  Kudos to the
team at RackSpace for assisting me in finding this solution.

The code change:

UTF-8 makes certain assertions about how multi-byte characters are
represented.  While this code change doesn't check all of those
assumptions, but it does ensure that all the non-first bytes have their
high bits set correctly.  This is likely to catch similar errors at
least regarding Latin characters.  If you are feeling ambitious, feel
free to check for the assertion that code-points are encoded in the
fewest number of bytes possible.  This patch is untested, but I prefer
that a developer more familiar with the libxml2 library give it a more
thorough once over. 

The following is a patch against what I just got out of CVS:

Index: entities.c
===================================================================
RCS file: /cvs/gnome/libxml2/entities.c,v
retrieving revision 1.84
diff -u -r1.84 entities.c
--- entities.c  1 Apr 2005 13:11:51 -0000       1.84
+++ entities.c  26 Sep 2005 17:41:21 -0000
@@ -568,7 +568,22 @@
                char buf[11], *ptr;
                int val = 0, l = 1;

-               if (*cur < 0xC0) {
+               if (
+                    (*cur < 0xC0) ||
+                    (
+                        ((cur[0] & 0xE0) == 0xC0) &&
+                        ((cur[1] & 0xC0) != 0x80)
+                    ) || (
+                        ((cur[0] & 0xF0) == 0xE0) &&
+                        (   ((cur[1] & 0xC0) != 0x80) ||
+                            ((cur[2] & 0xC0) != 0x80))
+                    ) || (
+                        ((cur[0] & 0xF8) == 0xF0) &&
+                        (   ((cur[1] & 0xC0) != 0x80) ||
+                            ((cur[2] & 0xC0) != 0x80) ||
+                            ((cur[3] & 0xC0) != 0x80))
+                    )
+                ) {
                    xmlEntitiesErr(XML_CHECK_NOT_UTF8,
                            "xmlEncodeEntitiesReentrant : input not
UTF-8");
                    if (doc != NULL)




Bruce Miller wrote:
Ah, there's your first problems...
missing -w, and use strict.
Beyond that, I can't help you...




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]