Re: Performance implications of GRegex structure
- From: mark mark mielke cc
- To: Yevgen Muntyan <muntyan tamu edu>
- Cc: gtk-devel-list gnome org
- Subject: Re: Performance implications of GRegex structure
- Date: Sat, 17 Mar 2007 10:51:52 -0400
On Sat, Mar 17, 2007 at 12:48:15AM -0500, Yevgen Muntyan wrote:
> I am suggesting something which is currently used in real code.
> Simple, nice, and working. *If* it's not as good as it should be,
> *then* it should be changed.
It's not as good as it should be. :-)
1) In terms of interface, it does not appear to be designed to be
thread-safe, and the documentation does not hint that it is
thread-safe. Some sort of guarantee would have to be made, such
as "the match function does not modify GRegexp state". PCRE makes
this guarantee:
The compiled form of a regular expression is not altered during
matching, so the same compiled pattern can safely be used by
several threads at once.
2) In terms of functionality, the concept of matching for anything
beyond the various simplest of matching, *does* have state that
should be kept and maintained. This state should not be kept in
GRegexp, because of 1). It should be kept in a GMatcher object.
This state includes:
- the compiled regexp being used
- the string being matched against
- the offset into the string that the next match will begin
- the captured parameters from the last match
This isn't to say that a GMatcher object should be *required*. There is
nothing wrong with having convenience functions for the simple cases.
This is why I suggested:
> > If you want simple - give up on GRegexp altogether. Try something like:
> > if (g_string_regexp_match(s, pattern))
> > ...
> > If it happens to do some sort of internal caching - great. If not?
> > At least it is simple. For Java, this maps to:
> > if (string.matches(pattern))
> > ...
> Should I say "if you want cool - give up on C altogether"? I guess
> not. I mean, it's not like I want simplicity at all costs. No need to
> provide fancy-shmancy java stuff. It's a real serious question: what
> would be the uses of GRegex, where it would be convenient
> to have separate Match structure and where it would not.
There should be simple wrapper functions for simple uses. If there was
a 'gboolean g_string_regexp_match(GString* string, gchar* pattern)' or a
'gboolean g_strmatch(gchar* string, gchar* pattern)', I would probably
use them. When using regular expressions, I find myself in two positions:
1) I want to express a complicated match using very simple code.
I am not concerned about performance. This would be best served
by the above two functions, that would require no object-handling
at all (at least for my program). If the patterns happened to be
cached, great. If not, at least my implementation is simple, which
represents: a) fewer product defects, and b) easier to maintain.
2) I have a need for a heavy-duty performance-optimized match function,
in which case my needs are beyond the very simple, and I require
it to perform well, and scale. For this, having two objects works
well in other languages. The compilation is performed as early as
required. The matching can then be freely performed by any thread
without worry. For example, if I have a multi-threaded server with
a thread-pool to manage active connections. I may use a regular
expression pattern to ensure that various parts of the protocol
are syntactically correct, and to extract data from the records
sent by the client. I do not wish to recompile the pattern every
time a new client is accepted, and I do not match to have a mutex
lock around any use of the GRegexp. I want any number of clients
to be at any number of match states, on the same pre-compiled
regular expression.
An important factor in all of this, is that regular expressions are
powerful because they allow complex expressions to be written safely.
Without a good regular expression library available to me in C, I've
found it necessary to write my own pattern matching code. This is very
error-prone. Even if I get it right - who is to say somebody who changes
my code will keep it right?
> > Having a matcher object serves more purposes than just thread
> > scaleability. What if I wish to walk through the string, finding
> > each match, processing each match as it is found?
> > Possible.
> > Why should I
> > have to search the entire search before I can display the first
> > match?
> You can't do the contrary - find all matches and display them.
> (I guess Marco should know better, I've never done stuff like
> this)
> > In Perl, this functionality is available as:
> > while ($scalar =~ /(pattern)/g) {
> > ... each match ...
> > }
> > With a Matcher object, the same can be accomplished in a thread-safe
> > manner.
> Could you show how it would be done (i.e. show C code)? And
> what's "Matcher"? Is it something that performs matching, or
> is it a result of search (match)? I guess it could make sense to
> collect all results after searching repeatedly, but it doesn't seem
> to be what you are talking about.
GMatcher would keep the following state:
- the compiled regexp being used
- the string being matched against
- the offset into the string that the next match will begin
- the captured parameters from the last match
This is a real life example of something I might want to do, using
a syntax which might be desirable for me. I have not made any requirements
for thread-safety here. The below is about functionality only:
GString* output = ...
GRegexp* header_regexp = g_regexp_new("/^Hyperlinks:\\n/m");
if (g_regexp_matches(header_regexp, output, 0, output->len)) {
GRegexp* record_regexp =
g_regexp_new("/^ (\\S+) (|\".*?\") -> (|\".*?\") (\\S+)$/m");
GMatcher* record_matcher =
g_regexp_matcher(record_regexp, output,
g_regexp_match_offset_end(0), output->len);
while (g_matcher_matches(record_matcher)) {
gchar* from_hlink_selector = g_matcher_group(1);
gchar* from_hlink_text = g_matcher_group(2);
gchar* to_hlink_selector = g_matcher_group(3);
gchar* to_hlink_text = g_matcher_group(4);
...
}
g_matcher_free(record_matcher);
}
g_regexp_free(section_regexp);
The above would match again:
Hyperlinks:
hlink-selector "..." -> "..." hlink-selector
...
Notice all the state kept above. Include the g_regexp_match_offset_end(0),
which keeps state in GRegexp.
For regular expressions to be very useful, they need to be complete.
Cheers,
mark
--
mark mielke cc / markm ncf ca / markm nortel com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...
http://mark.mielke.cc/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]