RE: Just a few UTF8 questions...

From: Matthias Clasen <maclas gmx de>
To: martyn 2 russell bt com
Cc: gtk-app-devel-list gnome org
Subject: RE: Just a few UTF8 questions...
Date: Wed, 9 Jul 2003 11:32:58 +0200 (MEST)

Also, if I read in from a socket to a gchar buffer[1024] and I then 
proceed to print that information in the form 
  
  g_message("socket input: %*s", bytes, buffer);

Does the * represent how many characters or bytes that are

printed from the

buffer?


There was a thread about this in gtk-list in March:

http://mail.gnome.org/archives/gtk-list/2003-March/msg00007.html

The answers were:

a) The way GLib uses UTF-8 together with printf has the 
unfortunate effect
   that the precision operates on bytes rather than characters.

b) Glibc has a "feature" where %Ns actually checks for a whole 
   number of characters in the current encoding. So, unless you
   are sure you are always going to be in an UTF-8 locale, avoid
   using %Ns. (You are basically OK for iso-8859-1, but will
   have problems in say, a Japanese locale.)


If I receive information in from a GLIB IO Channel, it should be UTF8
right?


If what Owen says is true, as I understand it, printf uses * for the
number
of bytes and GLIB's implementation uses it for the number of characters.


No. Owen speaks about glibc, and the precision is always the number of bytes
(unless you  use wprintf and wide characters). The feature Owen means is
that
glibc checks that the bytes to be printed form a valid sequence of
characters in
the encoding of the selected locale (ie that the byte array doesn't end in
the middle
of a multibyte character).


So if I receive a buffer filled with Russian characters, then my
buffer[1024] is FULL of multibyte characters.  Using GLIB's implementation
means that I would be attempting to print 1024 characters when infact
there
may only be 900.  This would be why it is causing a crash, but never when
the information is in english.  Do you agree?


io channels in fact return utf-8. For the rest, see above.

So I can presume that printing WITHOUT the * would be the fix?


The simplest solution would certainly be to nul-terminate the byte array and
omit the
precision.

Matthias


-- 
+++ GMX - Mail, Messaging & more  http://www.gmx.net +++

Jetzt ein- oder umsteigen und USB-Speicheruhr als Prämie sichern!

References:
- RE: Just a few UTF8 questions...
  - From: martyn . 2 . russell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]