UTF-8 Functions

From: Owen Taylor <otaylor redhat com>
To: gtk-devel-list gnome org
Cc: gtk-i18n-list gnome org, trow ximian com, darin bentspoon com
Subject: UTF-8 Functions
Date: 01 Jul 2001 21:17:02 -0400
I've now added the following functions to GLib. I'm pretty happy with
them as encapsulating the basic operations of this nature in a simple
manner.

The main change I'm considering at this point is to add a max_len (can
be -1) parameter to normalize(), casefold(), strup(), strdown(), and
collate_key(). It makes things just a little bit more complicated, but
I hate having to g_strndup() a portion of a string, do something, and
then free the dup'ed string immediately. 

(If you look inside the implementations, you'll see that this is
a convenience concern, not an efficiency concern at this point!) 

But if people have other easy-to-implement improvements, I'd be
happy to consider them as well.

Regards,
                                        Owen

/**
 * g_utf8_normalize:
 * @str: a UTF-8 encoded string.
 * @mode: the type of normalization to perform.
 * 
 * Convert a string into canonical form, standardizing
 * such issues as whether a character with an accent
 * is represented as a base character and combining
 * accent or as a single precomposed characters. You
 * should generally call g_utf8_normalize before
 * comparing two Unicode strings.
 *
 * The normalization mode %G_NORMALIZE_DEFAULT only
 * standardizes differences that do not affect the
 * text content, such as the above-mentioned accent
 * representation. %G_NORMALIZE_ALL also standardizes
 * the "compatibility" characters in Unicode, such
 * as SUPERSCRIPT THREE to the standard forms
 * (in this case DIGIT THREE). Formatting information
 * may be lost but for most text operations such
 * characters should be considered the same.
 * For example, g_utf8_collate() normalizes
 * with %G_NORMALIZE_ALL as its first step.
 *
 * %G_NORMALIZE_DEFAULT_COMPOSE and %G_NORMALIZE_ALL_COMPOSE
 * are like %G_NORMALIZE_DEFAULT and %G_NORMALIZE_ALL,
 * but returned a result with composed forms rather
 * than a maximally decomposed form. This is often
 * useful if you intend to convert the string to
 * a legacy encoding or pass it to a system with
 * less capable Unicode handling.
 * 
 * Return value: the string in normalized form
 **/
gchar *g_utf8_normalize (const gchar    *str,
		         GNormalizeMode  mode);

/**
 * g_ut8f_strdown:
 * @string: a UTF-8 encoded string
 * 
 * Converts all Unicode characters in the string that have a case
 * to lowercase. The exact manner that this is done depends
 * on the current locale, and may result in the number of
 * characters in the string changing.
 * 
 * Return value: a newly allocated string, with all characters
 *    converted to lowercase.  
 **/
gchar *g_utf8_strdown (const gchar *str);

/**
 * g_ut8f_strup:
 * @string: a UTF-8 encoded string
 * 
 * Converts all Unicode characters in the string that have a case
 * to uppercase. The exact manner that this is done depends
 * on the current locale, and may result in the number of
 * characters in the string increasing. (For instance, the
 * German ess-zet will be changed to SS.)
 * 
 * Return value: a newly allocated string, with all characters
 *    converted to uppercase.  
 **/
gchar *g_utf8_strup (const gchar *str);

/**
 * g_utf8_casefold:
 * @str: a UTF-8 encoded string
 * 
 * Converts a string into a form that is independent of case. The
 * result will not correspond to any particular case, but can be
 * compared for equality or ordered with the results of calling
 * g_utf8_casefold() on other strings.
 * 
 * Note that calling g_utf8_casefold() followed by g_utf8_collate() is
 * only an approximation to the correct linguistic case insensitive
 * ordering, though it is a fairly good one. Getting this exactly
 * right would require a more sophisticated collation function that
 * takes case sensitivity into account. GLib does not currently
 * provide such a function.
 * 
 * Return value: a newly allocated string, that is a
 *   case independent form of @str.
 **/
gchar *g_utf8_casefold (const gchar *str);

/**
 * g_utf8_collate:
 * @str1: a UTF-8 encoded string
 * @str2: a UTF-8 encoded string
 * 
 * Compares two strings for ordering using the linguistically
 * correct rules for the current locale. When sorting a large
 * number of strings, it will be significantly faster to
 * obtain collation keys with g_utf8_collate_key() and 
 * compare the keys with strcmp() when sorting instead of
 * sorting the original strings.
 * 
 * Return value: -1 if str1 compares before str2, 0 if they
 *   compare equal, 1 if str1 compares after str2.
 **/
gint g_utf8_collate (const gchar *str1, const gchar *str2);

/**
 * g_utf8_collate_key:
 * @str: a UTF-8 encoded string.
 * 
 * Converts a string into a collation key that can be compared
 * with other collation keys using strcmp(). The results of
 * comparing the collation keys of two strings with strcmp()
 * will always be the same as comparing the two original
 * keys with g_utf8_collate().
 * 
 * Return value: a newly allocated string. This string should
 *   be freed with g_free when you are done with it.
 **/
gchar *g_utf8_collate_key (const gchar *str);
Follow-Ups:
- Re: UTF-8 Functions
  - From: Pablo Saratxaga
- Re: UTF-8 Functions
  - From: Darin Adler
- Alternative interface proposal (was: Re: UTF-8 Functions)
  - From: Omer Zak
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]