How to keep UTF-8 characters, but escape non-UTF-8 byte sequence to hex codes in ASCII



Hi,

I am attempting to handle raw filenames (which may be encoded differently than the character set used by the filesystem) gracefully.

I am looking for a function that is similar to:
g_filename_display_name(),

but instead of converting illegal byte sequence to Unicode replacement character (0xef 0xbf 0xbd in UTF-8), I would like the illegal byte sequence to be escaped to ASCII as in URI.

To be clear, I want UTF-8 characters remain UTF-8 and only escape non-UTF-8 byte sequence. Is there a function that does that?

That is, I would like this string (demonstrating a mostly UTF-8 filename, with a raw character outside of UTF-8 character set):

Character:  P  r  e  s  e  n  t  a  c  i  ó  n  ó     .  s  x  i
Hex code:   50 72 65 73 65 6e 74 61 63 69 f3 6e c3 b3 2e 73 78 69

To be converted to this:
Character:  P  r  e  s  e  n  t  a  c  i  %  f  3  n  ó     .  s  x  i
Hex code:   50 72 65 73 65 6e 74 61 63 69 25 66 33 6e c3 b3 2e 73 78 69


I tried g_convert_with_fallback(str, -1, "UTF-8", "UTF-8" /* whatever codeset used by filesystem */, NULL, NULL, NULL, NULL), but this function won't accept non-UTF-8 input.

Thanks much for any hint.


--
Daniel Yek




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]