Re: Performance implications of GRegex structure

From: Owen Taylor <otaylor redhat com>
To: Yevgen Muntyan <muntyan tamu edu>
Cc: GTK+ development mailing list <gtk-devel-list gnome org>
Subject: Re: Performance implications of GRegex structure
Date: Fri, 16 Mar 2007 08:28:26 -0400

On Thu, 2007-03-15 at 14:16 -0500, Yevgen Muntyan wrote:
> [Owen, I apologize, I hit Reply instead of Reply All]
> 
> Owen Taylor wrote:
> > So, the regular expression code has been committed to CVS finally. Yay!
> >
> > But looking over the header file, there is something that puzzles me
> > about the way that it's set up: there is no distinction between a
> > "pattern/regular expression" object and a match/matcher object.
> ...
> > Or to Javascript, Perl, etc. (Javascript and Perl hide the issue a bit
> > by having regular expression literals.) While I have never actually done
> > timings on the matter, I've always assumed that the reason that regular
> > expression API's are set up this way is compiling a regular expression
> > has a significant expense.
> 
> EggRegex contains two structures internally - Pattern and Match.
> egg_regex_copy() references Pattern and creates new Match structure.
> If you got a copy of EggRegex object using egg_regex_copy(), you
> can use it in another thread. Don't know how convenient it is in
> multi-thread setup. The idea was that it's inconvenient to create
> some match objects on heap.

If people will forgive a code-verbose mail, I want to show some examples
of how things look in practice.

So, start off with a simple function written with the current API:

  char *
  get_leading_digits(const char *str)
  {
     GRegex *regex = g_regex_new("^\\d+", 0, 0, NULL);
     char *result = NULL;
 
     if (g_regex_match(str, 0))
         result = g_regex_fetch(regex, 0);

     g_regex_free(regex);

     return result;
  }

Short, simple, thread-safe, but inherently slow. We could write
the same thing with an API with an explicit match object:

  char *
  get_leading_digits(const char *str) 
  {
     GRegex *regex = g_regex_new("^\\d+", 0, 0, NULL);
     GMatcher *matcher = g_matcher_new(regex, str, 0);
     char *result = NULL;

     if (g_matcher_matches(matcher))
         result = g_matcher_get(regex, 0);

     g_matcher_unref(matcher);
     g_regex_free(regex);

     return result;
  }

It gets two lines longer and slightly more cumbersome, but it's not
tremendously worse. But really, repeatedly compiling a regular
expression is a bad idea - something we don't want to encourage.So, the
simplest optimization of the original code looks something like:

  char *
  get_leading_digits(const char *str)
  {
     static GRegex *regex = NULL;
     char *result = NULL;

     if (!regex)
         regex = g_regex_new("^\\d+", 0, 0, NULL);

     if (g_regex_match(str, 0))
         result = g_regex_fetch(regex, 0);

     return result;
  }

That code bothers me a fair bit; not because so much because it's
not thread safe, but because it exhibits a pattern that is *inherently*
not thread safe or re-entrant. People should have built-in reflexes
against writing such code. I'm much happier with the alternative:

  char *
  get_leading_digits(const char *str)
  {
     static GRegex *regex = NULL;
     GMatcher *matcher;
     char *result = NULL;

     if (!regex)
         regex = g_regex_new("^\\d+", 0, NULL);
   
     matcher = g_matcher_new(regex, str, 0);
     if (g_matcher_matches(matcher))
         result = g_matcher_get(regex, 0);

     g_matcher_free(matcher);

     return result;
  }

Even though it's longer and no more thread safe. It's not thread
safe, but can be made so without changing the basic pattern:

  gpointer
  get_leading_digits_regex(gpointer data) 
  {
       return g_regex_new("^\\d+", 0, NULL);
  }

  char *
  get_leading_digits(const char *str)
  {
     static GOnce once = G_ONCE_INIT;
     GRegex *regex = g_once(&once, get_leading_digits_regex, NULL);
     GMatcher *matcher;
     char *result = NULL;

     matcher = g_matcher_new(regex, str, 0);
     if (g_matcher_matches(matcher))
         result = g_matcher_get(regex, 0);

     g_matcher_free(matcher);

     return result;
  }

As Yevgen points out, we can do the same thing with the current GRegex
API, but I don't think the result is very readable. It isn't code that
"says what it does".

  gpointer
  get_leading_digits_regex(gpointer data) 
  {
       return g_regex_new("^\\d+", 0, NULL);
  }

  char *
  get_leading_digits(const char *str)
  {
     static GOnce once = G_ONCE_INIT;
     GRegex *template = g_once(&once, get_leading_digits_regex, NULL);
     GRegex *regex;
     char *result = NULL;

     regex = g_regex_copy(template);
     if (g_regex_match(str, 0))
         result = g_regex_fetch(regex, 0);

     g_regex_free(regex);
  
     return result;
  }

Now, a regular expression that you want compile once and keep in a
static variable is actually the most common use of a regular
expression. So, maybe the right version of the function actually
looks like:

  char *
  get_leading_digits(const char *str)
  { 
     GStaticRegex regex = G_STATIC_REGEX_INIT("^\\d+", 0);
     GMatcher *matcher;
     char *result = NULL;

     matcher = g_matcher_new_static(&regex, str, 0);
     if (g_matcher_matches(matcher))
         result = g_matcher_get(regex, 0);

     g_matcher_free(matcher);

     return result;
  }

I'm not going to argue that the current GRegex API is unworkable,
but I think it obscures the nature of the system - first you compile a
regular expression, then you match against it - and that's going to make
it harder for people to write correct, efficient code.

					- Owen

Follow-Ups:
- Re: Performance implications of GRegex structure
  - From: Morten Welinder
- Re: Performance implications of GRegex structure
  - From: Dimi Paun
- RE: Performance implications of GRegex structure
  - From: David Moffatt

References:
- Performance implications of GRegex structure
  - From: Owen Taylor
- Re: Performance implications of GRegex structure
  - From: Yevgen Muntyan

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]