Re: Glib Unicode regex (was: Gtk::Text widget)



Michael Livshin <mlivshin@bigfoot.com> writes:

> Havoc Pennington <hp@redhat.com> writes:
> 
> > >     Derek> * Are there any other Unicode-supporting regex libs we can look at?
> > > 
> > > Although I haven't checked the copyright, IBM's ICU library has one.
> > > The latest version of Perl has the best and most complete
> > > implementation I've seen yet, but it would be tough to untangle it
> > > from the surrounding code.
> > 
> > The Perl one looks viciously difficult to extract from Perl. I haven't
> > looked at the ICU engine, I'll ask Owen about it, I know he's looked
> > at ICU in general.
> 
> ISTR that the latest version of Henry Spencer's regexp library
> supports UTF8 natively.

The most recent version of Henry Spencer's library I'm familiar with -
the one in Tcl - supports Unicode via wide characters. Since Henry
Spencer did the work himself on adding this support, I don't really
expect that he did it again using UTF-8. But it is possible.

Tcl uses UTF-8 everywhere else, then coverts to wide characters
for regular expression compilation and matching, which seems 
quite painful from a performance point of view.

I certainly agree with Havoc that the Perl version is not feasibly
extractable from Perl. Also, there are license issues. It is,
however, the only UTF-8 based regular expression code I'm aware
of.

I'm not aware of any regular expressions functionality in ICU.
If there was some some, it would have the problem that the 
IBM PL is not GPL compatible.

There are a number of other Unicode-capable implementations of
Perl-style regular expressions available - there is the one in
Netscape's Javascript implementation (dual GPL, NPL, wide character);
there is the sre code in Python1.6 (wide character, still in alpha)

If we consider the requirements to be:

 - Support Perl-style regular expessions
 - Use UTF-8 natively
 - LGPL compatible

Then we don't seem to have much choice other than to create something
ourselves. I don't think it is a huge job to convert something
like PCRE to support UTF-8; maybe about a week to do a basic
job. But I doubt that is going to happen before GLib-2.0.

Regards,
                                        Owen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]