Re: How to keep UTF-8 characters, but escape non-UTF-8 byte sequence to hex codes in ASCII



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Nov 29, 2006 at 05:25:25PM -0800, Daniel Yek wrote:
Hi,

I am attempting to handle raw filenames (which may be encoded differently 
than the character set used by the filesystem) gracefully.

[...]
with a raw character outside of UTF-8 character set):

Character:  P  r  e  s  e  n  t  a  c  i  Ã  n  Ã     .  s  x  i
Hex code:   50 72 65 73 65 6e 74 61 63 69 f3 6e c3 b3 2e 73 78 69

To be converted to this:
Character:  P  r  e  s  e  n  t  a  c  i  %  f  3  n  Ã     .  s  x  i
Hex code:   50 72 65 73 65 6e 74 61 63 69 25 66 33 6e c3 b3 2e 73 78 69

And how is the converter supposed to guess that this "raw character"
(here 0xf3 and perhaps lots of following bytes) has to be interpreted as
an iso-8859-1 (or iso-8859-2) encoded thing (what you seem to imply
here)? This could be as well an "Ñ" or an "Ï" (to cite some unibyte
encodings. Going multibyte might be even more fun).

That means you'll have to handle those decisions yourself. Maybe the
libc routines iconv_open()/iconv()/iconv_close() help you with that
(they try to convert up to an illegal sequence, stop there and tell you).

HTH
- -- tomÃs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFbsaLBcgs9XrR2kYRAiFzAJwNeWk05WeekRO/xpy5SVizz0bRaACfZDYD
iL+hcmhMt4McadFzU4R3oSI=
=bSJY
-----END PGP SIGNATURE-----




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]