Re: The URL regex.
- From: Albrecht Dreß <albrecht dress arcormail de>
- To: Brian Stafford <brian stafford uklinux net>
- Cc: Pawel Salek <pawsa TheoChem kth se>,Carlos Morgado <chbm chbm nu>, balsa-list gnome org,Brian Stafford <brian stafford uklinux net>,Gediminas Paulauskas <menesis delfi lt>
- Subject: Re: The URL regex.
- Date: Tue, 22 May 2001 18:52:36 +0200
Am 22.05.2001 10:24:19 schrieb(en) Brian Stafford:
> | I *thought* that "\<" and "\>" match a word separator in regular
> | expressions,
>
> Correct.....
>
> | not the literal "<" and ">". `man 7 regex' says that it should be
> | "[[:<:]]" in
> | this case, though. But the first one seems to work anyway. Hmmm....
>
> .... the problem is this construct isn't portable. GNU regex uses
> \< and \> for the word boundaries, Henry Spencer's regex uses
> [[:<:]] and [[:>:]] and Emacs and PCRE use something else. I have
> no idea what other Unix RE packages allow as word boundaries or if
> they even have them. IIRC Posix doesn't define them.
Did I already say that I *love* standards? I remember now that I got them from
egrep, amd for me it works as it should, but [[:<:]] doesn't.
> Probably most reliable to omit them.
Some people (like me;-)) write e.g. a comma or a ")" directly behind an URL,
like...
: see http://www.balsa.net, or write a mail to Pawel
...and the "\>" worked perfectly to get rid of it. So this is not the complete
solution (but almost, see below).
> My only comment here is that I've replaced the character class of
> [^[:blank:]] with one which explicitly enumerates the permitted
> characters. Longer to write, but still compiles to one character class.
> In addition to the character class, the alternative RE explicitly
> checks for % followed by two hex digits.
Looks nice, for sure the better solution!
> If the not blank character class is retained, some extra characters
> will need adding to the class to improve the reliability of a
> valid match, e.g. the double quote '"', consider
> "http://some.host/some/path".
Agree.
> WIth the % escapes and the permitted characters explicitly listed
> the chances of matching the valid portion of the URL are surely
> improved. If the external program can't cope its not our problem.
ditto.
> RFC 1738 has been updated by several newer RFCs. RFC 2396 now
> describes the generic syntax.
Thanks for that hint, I'll check that...
> Since Balsa doesn't interpret the URLs, it only needs to match a
> generic URI and identify the protocol to find a program to handle it.
Agree again.
> Make the URI scheme portion of the RE \([[:alpha:]][-+.[:alnum:]]*\)://
> and the trailing portion the one I gave before. The scheme substring
> is then available for matching with the correct external program.
So, here is the next attempt for a RE (at some point we could start a
who-writes-the-longest-RE-which-does-the-correct-thing-contest ;-)):
char *url_str = "((((https?|ftps?|gopher|telnet|nntp)://)|(mailto:|news:))(%[0-9A-Fa-f]{2}|[-()_.!~*';/?:@&=+$,A-Za-z0-9])+)([).!';/?:,][[:blank:]])?";
As you can see, it already has the extension to detect mailto, news, nntp,
telnet and gopher (anyone still using that?!?). Basically, it uses your
proposal, but I replaced [:alnum:] as some locale might define national
characters belonging to this class (at least the man page states this). At the
end, a punctuation followed by a blank may be present. The trick is now to
supply regexec() with *two* regmatch_t records. The first one contains the
whole matched stuff, but in the second field (which we use) the last part
"([).!';/?:,][[:blank:]])?" is ommitted. In the end, this fakes the missing
"\>" oparator.
Am 22.05.2001 12:33:33 schrieb(en) Gediminas Paulauskas:
> I have looked into GtkHTML, it uses four regexps to recognize even URLs,
> with omitted protocol part.
*That's* nice!
> static HTMLMagicInsertMatch mim [] = {
> { "(news|telnet|nttp|file|http|ftp|https)://([-a-z0-9]+(:[-a-z0-9]+)?@)?[-a-z0-9.]+[-a-z0-9](:[0-9]*)?(/[-a-z0-9_$.+!*(),;:@%&=?/~#]*[^]'.}>\\)
> ,?!;:\"]?)?", NULL, NULL },
> { "www[-a-z0-9.]+[-a-z0-9](:[0-9]*)?(/[-A-Za-z0-9_$.+!*(),;:@%&=?/~#]*[^]'.}>\\)
> ,?!;:\"]?)?", NULL, "http://" },
> { "ftp[-a-z0-9.]+[-a-z0-9](:[0-9]*)?(/[-A-Za-z0-9_$.+!*(),;:@%&=?/~#]*[^]'.}>\\)
> ,?!;:\"]?)?", NULL, "ftp://" },
> { "[-_a-z0-9.]+@[-_a-z0-9.]+", NULL, "mailto:" }
> };
>
> The first part of structure is the URL, second -- the NULL, the last one is
> protocol prefix.
>
> So in Evolution I think all URLs are highlighted correctly.
This looks also very interesting. I'll try it... As always, stay tuned.
Thanks to you all,
Albrecht.
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Albrecht Dreß - Monschauer Straße 22 - D-53121 Bonn (Germany)
Phone (+49) 228 6199571 - E-Mail albrecht.dress@arcormail.de
_________________________________________________________________________
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]