RE: [xml] Perl module XML::LibXML not encoding UTF-8 properly In[SOLUTION]



Bruce R Miller wrote:

> Loren Osborn wrote:

> > I appreciate your feedback, but unfortunately it didn't give me any

> > additional warnings or errors.

>

> First off, thanks for taking my ill-tempered "feedback" with good humor...

 

I always try to have my "grain of salt" handy, but did want to acknowledge your taking the time to reply.  It is so often easier to ignore than reply, and I was grateful to get *A* response, even if it wasn't helpful.

 

> Running under 5.6, your original code produced a

> "output error : invalid character value"

> message.  I seem to recall problems in 5.6 with \x for the 2 digit case,

> although codepoints higher than FF work.

> In particular: \xED and \x{00ED} fail,

> but \N{LATIN SMALL LETTER I WITH ACUTE} works

> (with use charnames qw(:full); ).

>

> OTOH, under 5.8, your original encoding as \xED is apparently

> read correctly.  However the output simply outputs the unicode

> character, rather than the character entity í you were expecting.

 

Yes, I must admit that I only had 5.8 at my disposal, and made reference to 5.6 only based on what I read online, and I was only concerned with the string once it already existed in Perl.

 

In my specific situation though what I ended up with was a bogus Unicode character:

 

I started with the 3 character string:

"ía "

In Unicode this is:

      0xED  0x61  0x20

So, as UTF-8, within Perl this should have been stored:

      0xC3  0xAD  0x61  0x20

So what I expected as output was:

      "ía "

except that the Unicode code-points:

      0xED  0x61  0x20

were interpreted as the UTF-8 bytes:

      0xED  0xA1  0xA0

Which produced the illegal Unicode character:

      "�"  (absorbing the “a” and the space)

 

 

Now the solution was to encode the UTF-8 bytes as if they were code-points:

      0xED  0x61  0x20

becomes:

      0xC3  0x83  0xC2  0xAD  0x61  0x20

internally, within Perl. Which libxml2 now reads correctly as the byte stream:

      0xC3  0xAD  0x61  0x20

which it interpreted as the Unicode string:

      0xED  0x61  0x20

and produces the correct output:

      "ía "

 

I hope that is now less confusing.

 

 

Thanks again for your comments and feedback,

 

-Loren



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]