Re: The URL regex.

From: Albrecht Dreß <albrecht dress arcormail de>
To: Brian Stafford <brian stafford uklinux net>
Cc: Pawel Salek <pawsa TheoChem kth se>,Carlos Morgado <chbm chbm nu>, balsa-list gnome org,Brian Stafford <brian stafford uklinux net>,Gediminas Paulauskas <menesis delfi lt>
Subject: Re: The URL regex.
Date: Tue, 22 May 2001 18:52:36 +0200

Am 22.05.2001 10:24:19 schrieb(en) Brian Stafford:
> | I *thought* that "\<" and "\>" match a word separator in regular
> | expressions,
> 
> Correct.....
> 
> | not the literal "<" and ">". `man 7 regex' says that it should be
> | "[[:<:]]" in
> | this case, though. But the first one seems to work anyway. Hmmm....
> 
> .... the problem is this construct isn't portable.  GNU regex uses
>  \< and \> for the word boundaries, Henry Spencer's regex uses
> [[:<:]] and [[:>:]] and Emacs and PCRE use something else.  I have
> no idea what other Unix RE packages allow as word boundaries or if
> they even have them.  IIRC Posix doesn't define them.

Did I already say that I *love* standards? I remember now that I got them from
egrep, amd for me it works as it should, but [[:<:]] doesn't.

> Probably most reliable to omit them.

Some people (like me;-)) write e.g. a comma or a ")" directly behind an URL,
like...

: see http://www.balsa.net, or write a mail to Pawel

...and the "\>" worked perfectly to get rid of it. So this is not the complete
solution (but almost, see below).

> My only comment here is that I've replaced the character class of
> [^[:blank:]] with one which explicitly enumerates the permitted
> characters.  Longer to write, but still compiles to one character class.
> In addition to the character class, the alternative RE explicitly
> checks for % followed by two hex digits.

Looks nice, for sure the better solution!

> If the not blank character class is retained, some extra characters
> will need adding to the class to improve the reliability of a
> valid match, e.g. the double quote '"', consider
> "http://some.host/some/path".

Agree.

> WIth the % escapes and the permitted characters explicitly listed
> the chances of matching the valid portion of the URL are surely
> improved.  If the external program can't cope its not our problem.

ditto.

> RFC 1738 has been updated by several newer RFCs.  RFC 2396 now
> describes the generic syntax.

Thanks for that hint, I'll check that...

> Since Balsa doesn't interpret the URLs, it only needs to match a
> generic URI and identify the protocol to find a program to handle it.

Agree again.

> Make the URI scheme portion of the RE \([[:alpha:]][-+.[:alnum:]]*\)://
> and the trailing portion the one I gave before.  The scheme substring
> is then available for matching with the correct external program.

So, here is the next attempt for a RE (at some point we could start a
who-writes-the-longest-RE-which-does-the-correct-thing-contest ;-)):

char *url_str = "((((https?|ftps?|gopher|telnet|nntp)://)|(mailto:|news:))(%[0-9A-Fa-f]{2}|[-()_.!~*';/?:@&=+$,A-Za-z0-9])+)([).!';/?:,][[:blank:]])?";

As you can see, it already has the extension to detect mailto, news, nntp,
telnet and gopher (anyone still using that?!?). Basically, it uses your
proposal, but I replaced [:alnum:] as some locale might define national
characters belonging to this class (at least the man page states this). At the
end, a punctuation followed by a blank may be present. The trick is now to
supply regexec() with *two* regmatch_t records. The first one contains the
whole matched stuff, but in the second field (which we use) the last part
"([).!';/?:,][[:blank:]])?" is ommitted. In the end, this fakes the missing
"\>" oparator.

Am 22.05.2001 12:33:33 schrieb(en) Gediminas Paulauskas:
> I have looked into GtkHTML, it uses four regexps to recognize even URLs,
> with omitted protocol part.

*That's* nice!

> static HTMLMagicInsertMatch mim [] = {
> 	{ "(news|telnet|nttp|file|http|ftp|https)://([-a-z0-9]+(:[-a-z0-9]+)?@)?[-a-z0-9.]+[-a-z0-9](:[0-9]*)?(/[-a-z0-9_$.+!*(),;:@%&=?/~#]*[^]'.}>\\)
> ,?!;:\"]?)?", NULL, NULL },
> 	{ "www[-a-z0-9.]+[-a-z0-9](:[0-9]*)?(/[-A-Za-z0-9_$.+!*(),;:@%&=?/~#]*[^]'.}>\\)
> ,?!;:\"]?)?", NULL, "http://" },
> 	{ "ftp[-a-z0-9.]+[-a-z0-9](:[0-9]*)?(/[-A-Za-z0-9_$.+!*(),;:@%&=?/~#]*[^]'.}>\\)
> ,?!;:\"]?)?", NULL, "ftp://" },
> 	{ "[-_a-z0-9.]+@[-_a-z0-9.]+", NULL, "mailto:" }
> };
> 
> The first part of structure is the URL, second -- the NULL, the last one is
> protocol prefix. 
> 
> So in Evolution I think all URLs are highlighted correctly.

This looks also very interesting. I'll try it... As always, stay tuned.

Thanks to you all,

	Albrecht.


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Albrecht Dreß  -  Monschauer Straße 22  -  D-53121 Bonn (Germany)
      Phone (+49) 228 6199571  -  E-Mail albrecht.dress@arcormail.de
_________________________________________________________________________

References:
- The URL regex.
  - From: Pawel Salek
- Re: The URL regex.
  - From: Albrecht Dreß
- Re: The URL regex.
  - From: Brian Stafford

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]