Re: The URL regex.
- From: Brian Stafford <brian stafford uklinux net>
- To: Pawel Salek <pawsa TheoChem kth se>
- Cc: Carlos Morgado <chbm chbm nu>, balsa-list gnome org,albrecht dress arcormail de
- Subject: Re: The URL regex.
- Date: Mon, 21 May 2001 10:14:45 +0100
On Mon, 21 May 09:44 Pawel Salek wrote:
| Hi,
|
| I have noticed the url regex expression
| const char *url_str = "((ht|f)tps?://[^[:blank:]\r\n]+)";
| has been replaced by
| const char *url_str = "\\<((ht|f)tp[s]?://[^[:blank:]]+)\\>";
|
| IMO, the <>-brackets around the URL are rare and I think including
| them
| in the regex is not needed. Most often, the URLs are just quoted in
| the
| text and it is better to be able to click on them. (BTW, \r\n
| characters
| are included in [:blank:] and can be removed). What do you think?
I agree with the comments regarding the <>.
I would also suggest that "(ht|f)tp[s]?" is rewritten for clarity
e.g. "(http|ftp)s?". Also the [^[:blank:]]+ pattern doesn't exclude
characters like "()<>" etc that would normally be % quoted in a
URL. So its likely that the pattern picks up some trailing garbage.
Taking the legal characters from RFC 2396, I suggest the following
for the trailing portion of the pattern
(%[[:digit:]A-Fa-f][[:digit:]A-Fa-f]|[-_.!~*'();/?:@&=+$,[:alnum:])+
This pattern includes the parentheses characters but these could be
problematic so it might be best to omit them.
The complete RE, omitting the () characters, would be
(http|ftp)s?://(%[[:digit:]A-Fa-f][[:digit:]A-Fa-f]|[-_.!~*';/?:@&=+$,[:alnum:])+
Regards
Brian Stafford
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]