[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

RE: [xml] UTF8Toisolat1() usage



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

At 01:06 5/6/02, Morus Walter wrote:
>A conversion from UTF8 to Latin1 may only shorten the text (down to
>the half of the utf8 byte length in extreme cases).
>So allocating a buffer of the size of the UTF8 text will be sufficiant.

No - the Latin 1 string may be between 0.5 and 1 times as many bytes as the 
UTF-8 string.  For U+0000-U+007F, the UTF-8 and Latin 1 characters will 
both be one byte; for U+0080-U+00FF, the UTF-8 string will be two bytes to 
the Latin 1 string's one byte.  It would be wise to allocate a buffer just 
as long as the UTF-8 string, since any language that uses Latin 1 tends to 
use primarily the characters in the ASCII range.

>If you convert Latin1 to UTF8 the text might need up to twice the space.

That's true, though only if the string contains only accented characters 
and less-common punctuation (which is difficult for a meaningful string of 
any size in any European language).

~Chris
- -- 
Christopher R. Maden, Principal Consultant, crism consulting
DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
<URL: http://crism.maden.org/consulting/ >
PGP Fingerprint: BBA6 4085 DED0 E176 D6D4  5DFC AC52 F825 AFEC 58DA
-----BEGIN PGP SIGNATURE-----
Version: PGP Personal Privacy 6.5.8

iQA/AwUBPP3HFqxS+CWv7FjaEQJhOwCfeFJWPk2HEJGRuGLCkgdeRNCwA2sAn2wV
3auodVRhGSo807dfFn7SkSOc
=qK7l
-----END PGP SIGNATURE-----




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]