Hi Loren. On Mon Sep 26 2005 19:57, Loren Osborn wrote:
I appreciate your feedback, but unfortunately it didn't give me any additional warnings or errors. Fortunately I *DID* figure out both the cause and a solution. Additionally I'd like to propose a code change to catch most instances of this problem in the future. First, the cause: Perl 5.6 and above use UTF-8 for all strings internally unless explicitly told not to.
Not true. Perl 5.6 or above use UTF-8 for all strings that come into the memory via a UTF-8 enabled way (e.g. as literal strings from a source when use utf8 pragma is used or by reading a filehandle using the apropriate PERLIO layer, namely :encoding(...) or :utf8). Otherwise, they are treated as bytes, i.e. byte semantics is still the default in Perl. You can check if the UTF8 flag is on e.g. using Devel::Peek's Dump or via the Encode module. See 'perldoc perlunicode' for more info.
With that knowledge, it seemed natural to pass the already UTF-8 encoded string to the XML::LibXML library. Unfortunately, libxml2 (and by extension XML::LibXML) treated the string I passed it as a byte stream, and not a character (or code-point) stream.
Because it was a byte stream and not a character stream (i.e. the UTF8 flag was off on the particular scalar value). If your scalar had the UTF8 flag on, then XML::LibXML would treat it correctly as UTF8 and so would libxml2.
Secondly, the solution: As "ISO-8859-1" is identical to Unicode for code-points from 0x00 to 0xFF, you can use "ISO-8859-1" to double-UTF-8 encode the UTF-8 encoded string, so that when libxml2 treats the resulting code-points as bytes *IN* a UTF-8 stream, it produces the proper result. So instead of creating the text node with: XML::LibXML::Text->new($sValue) Use this instead: XML::LibXML::Text->new(encodeToUTF8( "ISO-8859-1",$sValue))
So you see, your scalar wasn't in fact internally represented as UTF8, but as bytes. That's why this helps.
This corrects the problem. And I suspect you need to decodeFromUTF8() when retrieving values from XML::LibXML.
Much better way (at least with perl 5.8.x) is just making sure you use character semantics in perl. That way, you can just forget about encodings and it just works. In more detail: - if the strings in your scripts are iso-8859-1, put use encoding 'iso-8859-1'; at the beginning of the script. If they are UTF-8, try use utf8; - if you want some specific encoding on the input and/or output, use one of the 'use open' variants (see 'perldoc open' for details).
Kudos to the team at RackSpace for assisting me in finding this solution.
Or you could have just read the documentation of XML::LibXML in more detail (specifically XML::LibXML::DOM) and/or search the archives of the perl-xml mailing list (perl-xml listserv activestate com), which, by the way, is much better audience for discussing perl XML modules, including XML::LibXML. Regards, -- Petr
The code change: UTF-8 makes certain assertions about how multi-byte characters are represented. While this code change doesn't check all of those assumptions, but it does ensure that all the non-first bytes have their high bits set correctly. This is likely to catch similar errors at least regarding Latin characters. If you are feeling ambitious, feel free to check for the assertion that code-points are encoded in the fewest number of bytes possible. This patch is untested, but I prefer that a developer more familiar with the libxml2 library give it a more thorough once over. The following is a patch against what I just got out of CVS: Index: entities.c =================================================================== RCS file: /cvs/gnome/libxml2/entities.c,v retrieving revision 1.84 diff -u -r1.84 entities.c --- entities.c 1 Apr 2005 13:11:51 -0000 1.84 +++ entities.c 26 Sep 2005 17:41:21 -0000 @@ -568,7 +568,22 @@ char buf[11], *ptr; int val = 0, l = 1; - if (*cur < 0xC0) { + if ( + (*cur < 0xC0) || + ( + ((cur[0] & 0xE0) == 0xC0) && + ((cur[1] & 0xC0) != 0x80) + ) || ( + ((cur[0] & 0xF0) == 0xE0) && + ( ((cur[1] & 0xC0) != 0x80) || + ((cur[2] & 0xC0) != 0x80)) + ) || ( + ((cur[0] & 0xF8) == 0xF0) && + ( ((cur[1] & 0xC0) != 0x80) || + ((cur[2] & 0xC0) != 0x80) || + ((cur[3] & 0xC0) != 0x80)) + ) + ) { xmlEntitiesErr(XML_CHECK_NOT_UTF8, "xmlEncodeEntitiesReentrant : input not UTF-8"); if (doc != NULL) Bruce Miller wrote: Ah, there's your first problems... missing -w, and use strict. Beyond that, I can't help you... _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml gnome org http://mail.gnome.org/mailman/listinfo/xml
Attachment:
pgpptjDYuT0Qx.pgp
Description: PGP signature