Re: Performance implications of GRegex structure



> When you evaluate an API, you have to look at a number of things:
> - Is the API complete? Can it do what is needed
> - Does the API allow getting common things done in a few lines of
>   code?
> - Is the API easy to figure out?
> - Is the resulting code legible and easy to read?
> - Does the API encourage writing efficient and correct code?

Personally, my main interests in this thread, based on tasks I've had to perform in the past, are these;


*) To be able to compile a pattern that allows a given search to be applied repeatedly, as efficiently as possible.

Once a pattern is compiled, there should be no reason for the compiled version to change.  It may even be possible to incorporate a pre-compiled regexp directly, if it makes any sense to do so.  As much work as possible should be done on the pattern so that matching against the compiled form should be as fast as possible.

A pattern intended to be used solely from one thread could, however, be partially compiled, and then finished off if and when it's needed.  There should, if this is the case, be a function to call to fully boil it down immediately if it's going to be shared.

[history] I have had a situation where I needed to search several MB of log files, for all occurrences of some 20 regular expressions.


*) to be able to search for a pattern repeatedly within a potentially infinite length data stream.

Even in a single-threadded application, a given pattern could be used concurrently on several different data streams.  So keeping the search state separate from the pattern would be a good idea.

Being able to obtain both the start and end of the last match will allow an application developer to decide whether to allow over-lapping matches or not.

[history] The closest I've come to this, is searching for one of several patterns within the first 4KB of a data stream.  These patterns were acceptance and rejection patterns of a TCP data stream.   Definite negatives on the acceptance patterns, and a definite positive on any rejection pattern, with as few accepted packets as possible, would have been exceedingly useful.


*) I personally don't care if it's a single object, or separate objects.  As long as it can be used where it's needed, without unnecessary duplication of the compiled regular expression, or unnecessary re-compilation.

Refcounting on the search pattern object may allow it to be shared without any unnecessary duplication, and would most certainly simplify the whole API.  However, it doesn't allow for multiple simultaneous searches without some pretty bizarre magic.  And a _copy() function isn't really that obvious, at least to me.  I'd be expecting it to copy everything, including the present search state, as opposed to just the search pattern.  Which isn't what you want (there may still be some use to a _copy() function that does copy the current search state).  At least with the separate object method, it's obvious that you're taking a pattern, and starting a search on it.  Reference counting on the pattern will allow it to be automatically destroyed when no longer needed, regardless of how many concurrent searches are running.


*) if any part of it is going to be called GMatch, then it should be done as a generic pattern searching API.  This could work well for allowing one routine to search for regexp's, fixed strings, file-system style glob expressions, and just about anything else someone can imagine.

Create a "pattern" object with g_match_regexp_new(), or g_match_fixed_new(), etc., and pass it to the GMatch functions to actually do the search.  The class of the pattern object would know how to interpret the object data correctly.  The "fixed" pattern compiler could, for example, build a Boyer-Moore style skip table.  The "glob" pattern compiler might simply function much as a simplified regexp.  More interestingly, this opens the way rather nicely to supporting different "levels" of regexp.

[history] This type of functionality has been quite useful in some scripting languages I have used on occasion.  Unfortunately, in those cases, I had to use some rather dirty tricks.  I imagine similar circumstances must occasionally occur in C also, and will require just this kind of structure.


*) wrappers should be written for the common cases.  g_match_regexp_simple() would search a buffer for a given regular expression.  It will essentially compile (g_match_regexp_new), match (g_match_???), destroy (g_object_unref), and return the basic boolean result.  Another variant can accept pointers into which to deposit the first several "results".  If allowing other search pattern types, similar functions would exist for them also.

GMatch would have a similar function which takes a compiled pattern, and runs it against a buffer, disregarding any result strings, and just returning the simple boolean result.  For any more complex tasks, the full weight of GMatch is available.

[history] There have been plenty of occasions when I've just wanted to know whether a string matches a pattern.


Fredderic

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]