RE: [xml] UTF8Toisolat1() usage

From: "Christopher R. Maden" <crism maden org>
To: xml gnome org
Subject: RE: [xml] UTF8Toisolat1() usage
Date: Wed, 05 Jun 2002 01:08:54 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

At 01:06 5/6/02, Morus Walter wrote:

A conversion from UTF8 to Latin1 may only shorten the text (down to
the half of the utf8 byte length in extreme cases).
So allocating a buffer of the size of the UTF8 text will be sufficiant.


No - the Latin 1 string may be between 0.5 and 1 times as many bytes as the 
UTF-8 string.  For U+0000-U+007F, the UTF-8 and Latin 1 characters will 
both be one byte; for U+0080-U+00FF, the UTF-8 string will be two bytes to 
the Latin 1 string's one byte.  It would be wise to allocate a buffer just 
as long as the UTF-8 string, since any language that uses Latin 1 tends to 
use primarily the characters in the ASCII range.

If you convert Latin1 to UTF8 the text might need up to twice the space.


That's true, though only if the string contains only accented characters 
and less-common punctuation (which is difficult for a meaningful string of 
any size in any European language).

~Chris
- -- 
Christopher R. Maden, Principal Consultant, crism consulting
DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
<URL: http://crism.maden.org/consulting/ >
PGP Fingerprint: BBA6 4085 DED0 E176 D6D4  5DFC AC52 F825 AFEC 58DA
-----BEGIN PGP SIGNATURE-----
Version: PGP Personal Privacy 6.5.8

iQA/AwUBPP3HFqxS+CWv7FjaEQJhOwCfeFJWPk2HEJGRuGLCkgdeRNCwA2sAn2wV
3auodVRhGSo807dfFn7SkSOc
=qK7l
-----END PGP SIGNATURE-----

References:
- RE: [xml] UTF8Toisolat1() usage
  - From: Henke, Markus
- RE: [xml] UTF8Toisolat1() usage
  - From: Morus Walter

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]