Performance implications of GRegex structure



So, the regular expression code has been committed to CVS finally. Yay!

But looking over the header file, there is something that puzzles me
about the way that it's set up: there is no distinction between a
"pattern/regular expression" object and a match/matcher object.


GRegex        *g_regex_new(const gchar         *pattern,
                           GRegexCompileFlags   compile_options,
                           GRegexMatchFlags     match_options,
                           GError             **error);
gboolean    g_regex_match (GRegex              *regex,
                           const gchar         *string,
                           GRegexMatchFlags     match_options);
gboolean g_regex_fetch_pos(const GRegex        *regex,
                           gint                 match_num,
                           gint                *start_pos,
                           gint                *end_pos);

Compare that to Java:

 Pattern pattern = new Pattern("(.*?)-(.*)");
 Matcher m = pattern.matcher(str);
 if (m.matches()) {
      before_dash = matcher.group(1)
 }

Or to Python:

 re = re.compile("(.*?)-(.*)")
 match = re.match("str)
 if m:
    before_dash = m.group(1)

Or to PCRE:

 pcre *compiled = pcre_compile("(.*?)-(.*), 0, &err, &err_offset, NULL);
 [...]
 if (pcre_exec(pattern->compiled, NULL,
               str, strlen(str), 0, 0,
               ovector, G_N_ELEMENTS(ovector)) >= 0) {
     before_dash = g_strndup(str + ovector[2], ovector[3] - ovector[2]);
  }

(There is no match[er] object here, but the equivalent is in all the in
and out parameters ...)

Or to Javascript, Perl, etc. (Javascript and Perl hide the issue a bit
by having regular expression literals.) While I have never actually done
timings on the matter, I've always assumed that the reason that regular
expression API's are set up this way is compiling a regular expression
has a significant expense.

With the GRegex structure I seem to have two choices:

 - Compile the regular expression once, and use it in a non-thread-safe,
   non-reentrant way. (shades of strtok)

 - Compile a new regular expression every time I want to do a match.

Neither is very appealing to me as a coder, though I could be convinced
that the second is OK by suitable performance timings. Do we have such
numbers?

					- Owen





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]