Re: How to keep UTF-8 characters, but escape non-UTF-8 byte sequence to hex codes in ASCII



Hi,

First, thanks for replying. I appreciate it.

At 03:54 AM 11/30/2006, tomas tuxteam de wrote:
On Wed, Nov 29, 2006 at 05:25:25PM -0800, Daniel Yek wrote:
> Hi,
>
> I am attempting to handle raw filenames (which may be encoded differently
> than the character set used by the filesystem) gracefully.
>
[...]
> with a raw character outside of UTF-8 character set):
>
> Character:  P  r  e  s  e  n  t  a  c  i  ó  n  ó     .  s  x  i
> Hex code:   50 72 65 73 65 6e 74 61 63 69 f3 6e c3 b3 2e 73 78 69
>
> To be converted to this:
> Character:  P  r  e  s  e  n  t  a  c  i  %  f  3  n  ó     .  s  x  i
> Hex code:   50 72 65 73 65 6e 74 61 63 69 25 66 33 6e c3 b3 2e 73 78 69

And how is the converter supposed to guess that this "raw character"
(here 0xf3 and perhaps lots of following bytes) has to be interpreted as
an iso-8859-1 (or iso-8859-2) encoded thing (what you seem to imply
here)?

No, I don't think I implied that. I stated that I want to handle "raw" filenames gracefully, whatever the encoding that I couldn't tell and don't care.

I understand what you are saying that if the original character set was not specified, there is no way you can detect it based on the bytes because of multitude of ambiguity. So, just call it "raw".

A lot of times, it is adequate to interpret the byte sequence with best attempt. g_filename_display_name() did that (so this answered your "question"), except that I didn't like how the illegal character (now, U+FFFD) is rendered -- it looks seriously "broken" and somewhat annoying. It is better to show illegal bytes in an easier to understand manner, like octal escape sequence or hex, or even a question mark.

Well, with g_utf8_validate(), it is trivial to implement a function that escape non-UTF-8 bytes to Hex. However, I then found out that TreeView, or more likely Pango, would unescape the %xx sequence (undo my attempt to help it) and choke!??!

I'm now quite sure that it is not worth the effort to handle a case like this, even though I think it should be do-able and pain-free. (If not quite as many things in GLib that get in the way.)

More random thoughts:
Is there a way to ask Pango to render illegal UTF-8 bytes as the more pleasant rectangle with hex number in it (as in the case when the font is not installed), rather than printing out cryptic messages on the terminal?

Thanks much.


--
Daniel Yek




This could be as well an "Ñ?" or an "Ï?" (to cite some unibyte
encodings. Going multibyte might be even more fun).

That means you'll have to handle those decisions yourself. Maybe the
libc routines iconv_open()/iconv()/iconv_close() help you with that
(they try to convert up to an illegal sequence, stop there and tell you).

HTH
- -- tomás




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]