Re: Performance implications of GRegex structure



On Sat, Mar 17, 2007 at 12:48:15AM -0500, Yevgen Muntyan wrote:
> I am suggesting something which is currently used in real code.
> Simple, nice, and working. *If* it's not as good as it should be,
> *then* it should be changed.

It's not as good as it should be. :-)

    1) In terms of interface, it does not appear to be designed to be
       thread-safe, and the documentation does not hint that it is
       thread-safe. Some sort of guarantee would have to be made, such
       as "the match function does not modify GRegexp state". PCRE makes
       this guarantee:

           The  compiled form of a regular expression is not altered during
           matching, so the same compiled pattern can safely be used by
           several threads at once.

    2) In terms of functionality, the concept of matching for anything
       beyond the various simplest of matching, *does* have state that
       should be kept and maintained. This state should not be kept in
       GRegexp, because of 1). It should be kept in a GMatcher object.
       This state includes:
            - the compiled regexp being used
            - the string being matched against
            - the offset into the string that the next match will begin
            - the captured parameters from the last match

This isn't to say that a GMatcher object should be *required*. There is
nothing wrong with having convenience functions for the simple cases.
This is why I suggested:

> > If you want simple - give up on GRegexp altogether. Try something like:
> >     if (g_string_regexp_match(s, pattern))
> >         ...
> > If it happens to do some sort of internal caching - great. If not?
> > At least it is simple. For Java, this maps to:
> >     if (string.matches(pattern))
> >         ...
> Should I say "if you want cool - give up on C altogether"? I guess
> not. I mean, it's not like I want simplicity at all costs. No need to
> provide fancy-shmancy java stuff. It's a real serious question: what
> would be the uses of GRegex, where it would be convenient
> to have separate Match structure and where it would not.

There should be simple wrapper functions for simple uses. If there was
a 'gboolean g_string_regexp_match(GString* string, gchar* pattern)' or a
'gboolean g_strmatch(gchar* string, gchar* pattern)', I would probably
use them. When using regular expressions, I find myself in two positions:

    1) I want to express a complicated match using very simple code.
       I am not concerned about performance. This would be best served
       by the above two functions, that would require no object-handling
       at all (at least for my program). If the patterns happened to be
       cached, great. If not, at least my implementation is simple, which
       represents: a) fewer product defects, and b) easier to maintain.

    2) I have a need for a heavy-duty performance-optimized match function,
       in which case my needs are beyond the very simple, and I require
       it to perform well, and scale. For this, having two objects works
       well in other languages. The compilation is performed as early as
       required. The matching can then be freely performed by any thread
       without worry. For example, if I have a multi-threaded server with
       a thread-pool to manage active connections. I may use a regular
       expression pattern to ensure that various parts of the protocol
       are syntactically correct, and to extract data from the records
       sent by the client. I do not wish to recompile the pattern every
       time a new client is accepted, and I do not match to have a mutex
       lock around any use of the GRegexp. I want any number of clients
       to be at any number of match states, on the same pre-compiled
       regular expression.

An important factor in all of this, is that regular expressions are
powerful because they allow complex expressions to be written safely.
Without a good regular expression library available to me in C, I've
found it necessary to write my own pattern matching code. This is very
error-prone. Even if I get it right - who is to say somebody who changes
my code will keep it right?

> > Having a matcher object serves more purposes than just thread
> > scaleability. What if I wish to walk through the string, finding
> > each match, processing each match as it is found?
> > Possible.
> > Why should I
> > have to search the entire search before I can display the first
> > match?
> You can't do the contrary - find all matches and display them.
> (I guess Marco should know better, I've never done stuff like
> this)
> > In Perl, this functionality is available as:
> >     while ($scalar =~ /(pattern)/g) {
> >         ... each match ...
> >     }
> > With a Matcher object, the same can be accomplished in a thread-safe
> > manner.
> Could you show how it would be done (i.e. show C code)? And
> what's "Matcher"? Is it something that performs matching, or
> is it a result of search (match)? I guess it could make sense to
> collect all results after searching repeatedly, but it doesn't seem
> to be what you are talking about.

GMatcher would keep the following state:
    - the compiled regexp being used
    - the string being matched against
    - the offset into the string that the next match will begin
    - the captured parameters from the last match

This is a real life example of something I might want to do, using
a syntax which might be desirable for me. I have not made any requirements
for thread-safety here. The below is about functionality only:

    GString* output = ...
    GRegexp* header_regexp = g_regexp_new("/^Hyperlinks:\\n/m");
    if (g_regexp_matches(header_regexp, output, 0, output->len)) {
        GRegexp* record_regexp =
            g_regexp_new("/^  (\\S+) (|\".*?\") -> (|\".*?\") (\\S+)$/m");
        GMatcher* record_matcher =
            g_regexp_matcher(record_regexp, output,
                             g_regexp_match_offset_end(0), output->len);
        while (g_matcher_matches(record_matcher)) {
            gchar* from_hlink_selector = g_matcher_group(1);
            gchar* from_hlink_text = g_matcher_group(2);
            gchar* to_hlink_selector = g_matcher_group(3);
            gchar* to_hlink_text = g_matcher_group(4);
            ...
        }
        g_matcher_free(record_matcher);
    }
    g_regexp_free(section_regexp);

The above would match again:

    Hyperlinks:
        hlink-selector "..." -> "..." hlink-selector
        ...

Notice all the state kept above. Include the g_regexp_match_offset_end(0),
which keeps state in GRegexp.

For regular expressions to be very useful, they need to be complete.

Cheers,
mark

-- 
mark mielke cc / markm ncf ca / markm nortel com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]