Re: GLib: wide-character gregex?



> Is there a regex package in GLib that is capable of searching/matching wide
> characters?

No. GLib's string APIs (except for the explicit wide char conversion
ones) handle just plain char strings, generally assumed to be UTF-8 in
cases where it matters. But if you know that a file is in wide
characters (i.e. UTF-16LE on Windows), then you can use
g_utf16_to_utf8() to convert its contents to UTF-8 once you have read
it in (or mapped it into memory).

> for future reference, I would like to try and track down a wchar_t
> implementation of regex functions. I was hoping GLib already had them, but
> perhaps I am wrong.

Wide characters (wchar_t), although per se part of standard C, in
practise are used mostly in Windows-specific programming. On Unix and
Linux, especially in free software circles, encoding Unicode as UTF-8
is the rule, and thus normal string functions and coding conventions
can be used. (One notable exception is OpenOffice.org, which used
UTF-16 internally also on Unix. Dunno about Mozilla, for instance.) So
in software being mainly developed by people using Linux, you seldom
see wchar_t.

(Note that the wchar_t type in gcc on Linux is 32 bits, not 16 bits
like on Windows, so it actually can represent all characters in
current Unicode. On Windows when you use wchar_t strings you still
have to take into consideration that some characters will actually
take a pair of wchar_ts, so in practise the kind of code you end up
writing doesn't differ significantly from code that handles UTF-8 or
other variable-length encodings anyway. It is a question of handling
Unicode characters as 1..4 chars or 1..2 wchar_ts. You can't just
pretend each wchar_t is a freestanding character, and that wchar_t
strings can be split at any place with each part being valid.
Surrogate pairs do exist.)

--tml


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]