I gotta admit what is throwing me for a loop though is
when I dump the variable directly to standard output, I get the two bytes
cooresponding to the correct UTF-8 encoding. It is possible that the print function does
this on the fly for non-UTF-8 encoded strings, but it does make the
problem rather hard to characterize: [lsosborn devbox dsi]$ ~/bin/LibXML+UTF8_test | grep
'<' Una
manzana al día mantiene
al doctor ausente. & blah <blah> [lsosborn devbox dsi]$ ~/bin/LibXML+UTF8_test | grep
'<' | od -t x1 0000000 09 55 6e 61 20 6d 61 6e 7a 61 6e 61 20 61 6c
20 0000020 64 c3 ad 61
20 6d 61 6e 74 69 65 6e 65 20 61 6c 0000040 20 64 6f 63 74 6f 72 20 61 75 73 65 6e 74 65
2e 0000060 20 26 20 62 6c 61 68 20 3c 62 6c 61 68 3e 20
0a 0000100 [lsosborn devbox dsi]$ Bruce Miller wrote: Petr's
message hits the core point: The string is still bytes, rather than unicode chars. The particular problem here is that the construct \x{...} only forces
conversion to unicode if the arg is > 0x100 (for compatibility
w/ old scripts). To force the conversion, you need either to use the \N{name} construct, or pack('U',$code). |