Re: The URL regex.

From: Brian Stafford <brian stafford uklinux net>
To: Albrecht Dreß <albrecht dress arcormail de>
Cc: Pawel Salek <pawsa TheoChem kth se>,Carlos Morgado <chbm chbm nu>, balsa-list gnome org,Brian Stafford <brian stafford uklinux net>
Subject: Re: The URL regex.
Date: Tue, 22 May 2001 09:24:19 +0100

On Mon, 21 May 18:22 Albrecht Dreß wrote:
| > I have noticed  the url regex expression 
| > const char *url_str = "((ht|f)tps?://[^[:blank:]\r\n]+)";
| > has been replaced by
| > const char *url_str = "\\<((ht|f)tp[s]?://[^[:blank:]]+)\\>";  
| 
| I think the second one was my original, not sure anymore. I don't
| remember who
| introduced the first version (it *might* actually be me, but it's of
| course
| not correct;-))...
| 
| > IMO, the <>-brackets around the URL are rare and I think including
| them
| 
| I *thought* that "\<" and "\>" match a word separator in regular
| expressions,

Correct.....

| not the literal "<" and ">". `man 7 regex' says that it should be
| "[[:<:]]" in
| this case, though. But the first one seems to work anyway. Hmmm....

.... the problem is this construct isn't portable.  GNU regex uses
 \< and \> for the word boundaries, Henry Spencer's regex uses
[[:<:]] and [[:>:]] and Emacs and PCRE use something else.  I have
no idea what other Unix RE packages allow as word boundaries or if
they even have them.  IIRC Posix doesn't define them.

Probably most reliable to omit them.

| > Also the [^[:blank:]]+ pattern doesn't exclude
| > characters like "()<>" etc that would normally be % quoted in a
| > URL.  So its likely that the pattern picks up some trailing garbage.
| > 
| > Taking the legal characters from RFC 2396, I suggest the following
| > for the trailing portion of the pattern
| > 
| > (%[[:digit:]A-Fa-f][[:digit:]A-Fa-f]|[-_.!~*'();/?:@&=+$,[:alnum:])+
| > 
| > This pattern includes the parentheses characters but these could be
| > problematic so it might be best to omit them.
| > 
| > The complete RE, omitting the () characters, would be
| > 
| > (http|ftp)s?://(%[[:digit:]A-Fa-f][[:digit:]A-Fa-f]|[-_.!~*';/?:@&=+$,[:alnum:])+
| 
| IMHO, it is sufficient to check for a string without blanks, separated
| by
| word boundaries (space, beginning/end of line, ")", ".", ..., *if* we
| find the
| correct coding for that, see above;-)).

My only comment here is that I've replaced the character class of
[^[:blank:]] with one which explicitly enumerates the permitted
characters.  Longer to write, but still compiles to one character class.
In addition to the character class, the alternative RE explicitly
checks for % followed by two hex digits.

If the not blank character class is retained, some extra characters
will need adding to the class to improve the reliability of a
valid match, e.g. the double quote '"', consider
"http://some.host/some/path".

| If the user gets a mail with a
| strange
| URL in it, the browser might fail, and there is some manual
| intervention
| needed.

WIth the % escapes and the permitted characters explicitly listed
the chances of matching the valid portion of the URL are surely
improved.  If the external program can't cope its not our problem.

| On the other hand, creating bullet-proof regex's for both http
| and ftp

Agreed

| is more complicated, as the syntax differs a little bit (ftp allows a
| login
| string, e.g. ftp://user:secret@some.host.com:42/some/file, see RFC
| 1738, http

RFC 1738 has been updated by several newer RFCs.  RFC 2396 now
describes the generic syntax.

| doesn't). So I think we have two options:

Not quite.  From RFC 2396:
   This document defines a grammar that is a superset of all valid URI,
   such that an implementation can parse the common components of a URI
   reference without knowing the scheme-specific requirements of every
   possible identifier type.  This document does not define a generative
   grammar for URI; that task will be performed by the individual
   specifications of each URI scheme.

Since Balsa doesn't interpret the URLs, it only needs to match a
generic URI and identify the protocol to find a program to handle it.

Incidentally, Netscape will accept the user:secret@ portion in an
HTTP URL and use the information to perform HTTP authentication.
However the URL in the HTTP request does not include this information.

| * keep the current solution and rest in peace or

What's wrong with the more specific charcater class rather than
"not blanks"?  If the pattern matches printing characters that
are not valid without %-quoting this is more likely to cause
problems for the external program.

| * make one *separate* regex for each of the following: https?, ftp,
| mailto,
| nntp, news, telnet.

No, that's overkill. Make one RE to match generic URI syntax.

| I am currently working on a solution to make all of this list
| clickable, and
| what I read from you makes me beleive that the second solution is the
| better
| one. What do you think about that?

Make the URI scheme portion of the RE \([[:alpha:]][-+.[:alnum:]]*\)://
and the trailing portion the one I gave before.  The scheme substring
is then available for matching with the correct external program.

Have a look at the RE from appendix B in RFC 2396.


Regards
Brian Stafford

Follow-Ups:
- Re: The URL regex.
  - From: Peter Bloomfield
- Re: The URL regex.
  - From: Gediminas Paulauskas
- Re: The URL regex.
  - From: Albrecht Dreß

References:
- The URL regex.
  - From: Pawel Salek
- Re: The URL regex.
  - From: Albrecht Dreß

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]