Re: The URL regex.

On Mon, 21 May 09:44 Pawel Salek wrote:
| Hi,
| I have noticed  the url regex expression 
| const char *url_str = "((ht|f)tps?://[^[:blank:]\r\n]+)";
| has been replaced by
| const char *url_str = "\\<((ht|f)tp[s]?://[^[:blank:]]+)\\>";     
| IMO, the <>-brackets around the URL are rare and I think including
| them
| in the regex is not needed. Most often, the URLs are just quoted in
| the
| text and it is better to be able to click on them. (BTW, \r\n
| characters
| are included in [:blank:] and can be removed). What do you think?

I agree with the comments regarding the <>.

I would also suggest that "(ht|f)tp[s]?" is rewritten for clarity
e.g. "(http|ftp)s?".  Also the [^[:blank:]]+ pattern doesn't exclude
characters like "()<>" etc that would normally be % quoted in a
URL.  So its likely that the pattern picks up some trailing garbage.

Taking the legal characters from RFC 2396, I suggest the following
for the trailing portion of the pattern


This pattern includes the parentheses characters but these could be
problematic so it might be best to omit them.

The complete RE, omitting the () characters, would be


Brian Stafford

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]