Re: How to keep UTF-8 characters, but escape non-UTF-8 byte sequence to hex codes in ASCII



On Thu, Nov 30, 2006 at 11:54:51AM +0000, tomas tuxteam de wrote:
On Wed, Nov 29, 2006 at 05:25:25PM -0800, Daniel Yek wrote:
I am attempting to handle raw filenames (which may be encoded differently 
than the character set used by the filesystem) gracefully.

[...]
with a raw character outside of UTF-8 character set):

Character:  P  r  e  s  e  n  t  a  c  i  ó  n  ó     .  s  x  i
Hex code:   50 72 65 73 65 6e 74 61 63 69 f3 6e c3 b3 2e 73 78 69

To be converted to this:
Character:  P  r  e  s  e  n  t  a  c  i  %  f  3  n  ó     .  s  x  i
Hex code:   50 72 65 73 65 6e 74 61 63 69 25 66 33 6e c3 b3 2e 73 78 69

And how is the converter supposed to guess that this "raw character"
(here 0xf3 and perhaps lots of following bytes) has to be interpreted as
an iso-8859-1 (or iso-8859-2) encoded thing (what you seem to imply
here)? This could be as well an "??" or an "??" (to cite some unibyte
encodings...

I suppose the goal is to preserve information about the
bytes in a situation their interpretation (i.e. what
characters they represent) is already lost, and in that case
your question is void.  Whether or not this can be actually
helpful I will not judge.

OP: I doubt there is any function doing this, but UTF-8
validation is very simple so you can write the function
easily yourself.

Yeti


--
Whatever.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]