Re: The URL regex.



On Mon, 21 May 09:44 Pawel Salek wrote:
| Hi,
| 
| I have noticed  the url regex expression 
| const char *url_str = "((ht|f)tps?://[^[:blank:]\r\n]+)";
| has been replaced by
| const char *url_str = "\\<((ht|f)tp[s]?://[^[:blank:]]+)\\>";     
| 
| IMO, the <>-brackets around the URL are rare and I think including
| them
| in the regex is not needed. Most often, the URLs are just quoted in
| the
| text and it is better to be able to click on them. (BTW, \r\n
| characters
| are included in [:blank:] and can be removed). What do you think?

I agree with the comments regarding the <>.

I would also suggest that "(ht|f)tp[s]?" is rewritten for clarity
e.g. "(http|ftp)s?".  Also the [^[:blank:]]+ pattern doesn't exclude
characters like "()<>" etc that would normally be % quoted in a
URL.  So its likely that the pattern picks up some trailing garbage.

Taking the legal characters from RFC 2396, I suggest the following
for the trailing portion of the pattern

(%[[:digit:]A-Fa-f][[:digit:]A-Fa-f]|[-_.!~*'();/?:@&=+$,[:alnum:])+

This pattern includes the parentheses characters but these could be
problematic so it might be best to omit them.

The complete RE, omitting the () characters, would be

(http|ftp)s?://(%[[:digit:]A-Fa-f][[:digit:]A-Fa-f]|[-_.!~*';/?:@&=+$,[:alnum:])+

Regards
Brian Stafford




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]