[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
RE: Just a few UTF8 questions...
- From: Matthias Clasen <maclas gmx de>
- To: martyn 2 russell bt com
- Cc: gtk-app-devel-list gnome org
- Subject: RE: Just a few UTF8 questions...
- Date: Wed, 9 Jul 2003 11:32:58 +0200 (MEST)
> > > Also, if I read in from a socket to a gchar buffer[1024] and I then
> > > proceed to print that information in the form
> > >
> > > g_message("socket input: %*s", bytes, buffer);
> > >
> > > Does the * represent how many characters or bytes that are
> > printed from the
> > > buffer?
> >
> > There was a thread about this in gtk-list in March:
> >
> > http://mail.gnome.org/archives/gtk-list/2003-March/msg00007.html
> >
> > The answers were:
> >
> > a) The way GLib uses UTF-8 together with printf has the
> > unfortunate effect
> > that the precision operates on bytes rather than characters.
> >
> > b) Glibc has a "feature" where %Ns actually checks for a whole
> > number of characters in the current encoding. So, unless you
> > are sure you are always going to be in an UTF-8 locale, avoid
> > using %Ns. (You are basically OK for iso-8859-1, but will
> > have problems in say, a Japanese locale.)
>
> If I receive information in from a GLIB IO Channel, it should be UTF8
> right?
>
>
> If what Owen says is true, as I understand it, printf uses * for the
> number
> of bytes and GLIB's implementation uses it for the number of characters.
No. Owen speaks about glibc, and the precision is always the number of bytes
(unless you use wprintf and wide characters). The feature Owen means is
that
glibc checks that the bytes to be printed form a valid sequence of
characters in
the encoding of the selected locale (ie that the byte array doesn't end in
the middle
of a multibyte character).
>
> So if I receive a buffer filled with Russian characters, then my
> buffer[1024] is FULL of multibyte characters. Using GLIB's implementation
> means that I would be attempting to print 1024 characters when infact
> there
> may only be 900. This would be why it is causing a crash, but never when
> the information is in english. Do you agree?
io channels in fact return utf-8. For the rest, see above.
> So I can presume that printing WITHOUT the * would be the fix?
The simplest solution would certainly be to nul-terminate the byte array and
omit the
precision.
Matthias
--
+++ GMX - Mail, Messaging & more http://www.gmx.net +++
Jetzt ein- oder umsteigen und USB-Speicheruhr als Prämie sichern!
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]