Re: [xml] Perl module XML::LibXML not encoding UTF-8 properly [SOLUTION]

From: Petr Pajas <pajas ufal mff cuni cz>
To: xml gnome org
Cc:
Subject: Re: [xml] Perl module XML::LibXML not encoding UTF-8 properly [SOLUTION]
Date: Mon, 26 Sep 2005 22:41:57 +0200

Hi Loren.

On Mon Sep 26 2005 19:57, Loren Osborn wrote:

I appreciate your feedback, but unfortunately it didn't give me any
additional warnings or errors.  Fortunately I *DID* figure out both the
cause and a solution.  Additionally I'd like to propose a code change to
catch most instances of this problem in the future.

First, the cause:

Perl 5.6 and above use UTF-8 for all strings internally unless
explicitly told not to.


Not true. Perl 5.6 or above use UTF-8 for all strings that come into the 
memory via a UTF-8 enabled way (e.g. as literal strings from a source when 
use utf8 pragma is used or by reading a filehandle using the apropriate 
PERLIO layer, namely :encoding(...) or :utf8). Otherwise, they are treated as 
bytes, i.e. byte semantics is still the default in Perl. You can check if the 
UTF8 flag is on e.g. using Devel::Peek's Dump or via the Encode module. See 
'perldoc perlunicode' for more info.

With that knowledge, it seemed natural to pass 
the already UTF-8 encoded string to the XML::LibXML library.
Unfortunately, libxml2 (and by extension XML::LibXML) treated the string
I passed it as a byte stream, and not a character (or code-point)
stream.


Because it was a byte stream and not a character stream (i.e. the UTF8 flag 
was off on the particular scalar value). If your scalar had the UTF8 flag on, 
then XML::LibXML would treat it correctly as UTF8 and so would libxml2.

Secondly, the solution:

As "ISO-8859-1" is identical to Unicode for code-points from 0x00 to
0xFF, you can use "ISO-8859-1" to double-UTF-8 encode the UTF-8 encoded
string, so that when libxml2 treats the resulting code-points as bytes
*IN* a UTF-8 stream, it produces the proper result.

    So instead of creating the text node with:

        XML::LibXML::Text->new($sValue)

    Use this instead:

        XML::LibXML::Text->new(encodeToUTF8( "ISO-8859-1",$sValue))


So you see, your scalar wasn't in fact internally represented as UTF8, but as 
bytes. That's why this helps.


    This corrects the problem. And I suspect you need to
decodeFromUTF8() when retrieving values from XML::LibXML.


Much better way (at least with perl 5.8.x) is just making sure you use 
character semantics in perl. That way, you can just forget about encodings 
and it just works. In more detail: 

- if the strings in your scripts are iso-8859-1, put 

use encoding 'iso-8859-1';

at the beginning of the script. If they are UTF-8, try

use utf8;

- if you want some specific encoding on the input and/or output, use one of 
the 'use open' variants (see 'perldoc open' for details).

Kudos to the 
team at RackSpace for assisting me in finding this solution.


Or you could have just read the documentation of XML::LibXML in more detail 
(specifically XML::LibXML::DOM) and/or search the archives of the perl-xml 
mailing list (perl-xml listserv activestate com), which, by the way, is much 
better audience for discussing perl XML modules, including XML::LibXML.

Regards,

-- Petr

The code change:

UTF-8 makes certain assertions about how multi-byte characters are
represented.  While this code change doesn't check all of those
assumptions, but it does ensure that all the non-first bytes have their
high bits set correctly.  This is likely to catch similar errors at
least regarding Latin characters.  If you are feeling ambitious, feel
free to check for the assertion that code-points are encoded in the
fewest number of bytes possible.  This patch is untested, but I prefer
that a developer more familiar with the libxml2 library give it a more
thorough once over.

The following is a patch against what I just got out of CVS:

Index: entities.c
===================================================================
RCS file: /cvs/gnome/libxml2/entities.c,v
retrieving revision 1.84
diff -u -r1.84 entities.c
--- entities.c  1 Apr 2005 13:11:51 -0000       1.84
+++ entities.c  26 Sep 2005 17:41:21 -0000
@@ -568,7 +568,22 @@
                char buf[11], *ptr;
                int val = 0, l = 1;

-               if (*cur < 0xC0) {
+               if (
+                    (*cur < 0xC0) ||
+                    (
+                        ((cur[0] & 0xE0) == 0xC0) &&
+                        ((cur[1] & 0xC0) != 0x80)
+                    ) || (
+                        ((cur[0] & 0xF0) == 0xE0) &&
+                        (   ((cur[1] & 0xC0) != 0x80) ||
+                            ((cur[2] & 0xC0) != 0x80))
+                    ) || (
+                        ((cur[0] & 0xF8) == 0xF0) &&
+                        (   ((cur[1] & 0xC0) != 0x80) ||
+                            ((cur[2] & 0xC0) != 0x80) ||
+                            ((cur[3] & 0xC0) != 0x80))
+                    )
+                ) {
                    xmlEntitiesErr(XML_CHECK_NOT_UTF8,
                            "xmlEncodeEntitiesReentrant : input not
UTF-8");
                    if (doc != NULL)




Bruce Miller wrote:
Ah, there's your first problems...
missing -w, and use strict.
Beyond that, I can't help you...

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml

Attachment: pgpptjDYuT0Qx.pgp
Description: PGP signature

References:
- RE: [xml] Perl module XML::LibXML not encoding UTF-8 properly [SOLUTION]
  - From: Loren Osborn

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]