RE: [xml] Perl module XML::LibXML not encoding UTF-8 properly In[SOLUTION]

From: "Loren Osborn" <lsosborn dis-sol-inc com>
To: <xml gnome org>
Subject: RE: [xml] Perl module XML::LibXML not encoding UTF-8 properly In[SOLUTION]
Date: Mon, 26 Sep 2005 14:07:01 -0700

Bruce R Miller wrote:

> Loren Osborn wrote:

> > I appreciate your feedback, but unfortunately it didn't give me any

> > additional warnings or errors.

> First off, thanks for taking my ill-tempered "feedback" with good humor...

I always try to have my "grain of salt" handy, but did want to acknowledge your taking the time to reply. It is so often easier to ignore than reply, and I was grateful to get *A* response, even if it wasn't helpful.

> Running under 5.6, your original code produced a

> "output error : invalid character value"

> message. I seem to recall problems in 5.6 with \x for the 2 digit case,

> although codepoints higher than FF work.

> In particular: \xED and \x{00ED} fail,

> but \N{LATIN SMALL LETTER I WITH ACUTE} works

> (with use charnames qw(:full); ).

> OTOH, under 5.8, your original encoding as \xED is apparently

> read correctly. However the output simply outputs the unicode

> character, rather than the character entity í you were expecting.

Yes, I must admit that I only had 5.8 at my disposal, and made reference to 5.6 only based on what I read online, and I was only concerned with the string once it already existed in Perl.

In my specific situation though what I ended up with was a bogus Unicode character:

I started with the 3 character string:

"ía "

In Unicode this is:

0xED 0x61 0x20

So, as UTF-8, within Perl this should have been stored:

0xC3 0xAD 0x61 0x20

So what I expected as output was:

"ía "

except that the Unicode code-points:

0xED 0x61 0x20

were interpreted as the UTF-8 bytes:

0xED 0xA1 0xA0

Which produced the illegal Unicode character:

"&#xD860;" (absorbing the “a” and the space)

Now the solution was to encode the UTF-8 bytes as if they were code-points:

0xED 0x61 0x20

becomes:

0xC3 0x83 0xC2 0xAD 0x61 0x20

internally, within Perl. Which libxml2 now reads correctly as the byte stream:

0xC3 0xAD 0x61 0x20

which it interpreted as the Unicode string:

0xED 0x61 0x20

and produces the correct output:

"ía "

I hope that is now less confusing.

Thanks again for your comments and feedback,

-Loren

Follow-Ups:
- Re: [xml] Perl module XML::LibXML not encoding UTF-8 properly In[SOLUTION]
  - From: Bruce Miller

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]