10

I have such regexp:

 re.compile(r"((https?):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", re.MULTILINE|re.UNICODE)

But that doesn't include hashbangs (#!). What I need to change, to get it working? I know I can add ! to group with #@% etc, but that will select something like

Check this out: http://example.com/something/!!!

and I want to avoid that.

Kirill Polishchuk
  • 52,773
  • 10
  • 120
  • 121
ThomK
  • 587
  • 2
  • 7
  • 21
  • 2
    How about checking out the RFC for URI syntax (http://www.ietf.org/rfc/rfc3986.txt)? It will show you that the bang can only be used in certain ways otherwise it has to be escaped. Good question. – Ray Toal Jul 16 '11 at 16:20
  • 1
    I hope you're not trying to use this regex to match URLs requested by a browser: if so, you should realise that the part after the hash is *not* sent in a normal client request. – Daniel Roseman Jul 16 '11 at 17:31
  • No. I'm parsing user input and make links shorter and safer for users (we have full control, we can block link, domain etc.). And with original regex there was http://ourshortdomain.foo/urlhash/#!/twitter/something ;) – ThomK Jul 17 '11 at 19:12

6 Answers6

20

Don't try to make your own regular expression for matching URLs, use someone else's who has already solved such problems, like this one.

kindall
  • 168,929
  • 32
  • 262
  • 294
  • 32
    While there's nothing wrong in using somebody's else code, there's nothing wrong in writing your own either! :) I think if everybody would follow the recommendation _"Don't try to make your own , use someone else's"_ we would still all be living in caves! ;) – mac Jul 16 '11 at 16:41
  • 2
    @mac - If everyone had to reinvent everything, we'd make progress much more slowly. Far better to use someone else's completed idea and then make it better by improving it or adding something new to it. Even Newton acknowledged that he was building on the foundation of others' work. – unpythonic Jul 16 '11 at 17:29
  • 1
    @Mark - I surely don't argue with that and I never said that _everybody_ should reinvent the wheel! :) I just hold that there is not an hard rule to follow: sometimes it make sense to use other's work, sometimes it doesn't. – mac Jul 16 '11 at 17:36
  • 1
    @mac - You're absolutely right. However, we should gently nudge those who write horrific regular expressions into copying others' work until they gain enough knowledge so as to not leave a nightmare of others to maintain. :) :) – unpythonic Jul 16 '11 at 17:43
  • The method in this link doesn't match some valid urls, specifically url shorteners. I'd put an example, but SO doesn't let me put shortened urls. But specifically, it doesn't work with Twitter's shortener, `'https:// + 't.co' + '/blah'`. – dboshardy Jul 05 '17 at 20:34
  • The updated one in Gruber's gist works with the t.co URLs. I'll update the link. – kindall Jul 06 '17 at 00:00
  • @kindall It turns out I made a goof. PyCharm for whatever reason, when you let it fix lines that are too long, and your line is a triple-quoted string, it inserts a space at the end of your string... – dboshardy Jul 10 '17 at 14:06
  • 4
    The regex in the link is terrible: it attempts to list year 2011 known Top Level Domains and becomes VERY quickly OBSOLETE. – Cœur Mar 21 '18 at 10:25
7

This is a common problem, use default libraries.

For python use urlparse

Alireza Mazochi
  • 633
  • 1
  • 6
  • 18
estani
  • 21,251
  • 2
  • 85
  • 65
  • urlparse would still parse OP's problem URL: urlparse.urlparse('http://example.com/something/!!!') – hoju Jan 09 '14 at 20:56
  • Well that's a valid url, so first of all use an url parser to get the info. Then you can decide what to do with it. I doubt a semantic parser is really what he wants, far more simple is to try the url out. If it doesn work, strip the last characters and try again... – estani Jan 31 '14 at 14:31
7

It could be very long but in practice mine works pretty good. Please try this one ((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*

It matches all of the example below

http://wwww.stackoverflow.com
abc.com
http://test.test-75.1474.stackoverflow.com/
stackoverflow.com/
stackoverflow.com
rfordyce@broadviewnet.com
http://www.example.com/etcetc
www.example.com/etcetc
example.com/etcetc
user:pass@example.com/etcetc
(www.itmag.com)
example.com/etcetc?query=aasd
example.com/etcetc?query=aasd&dest=asds
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-
match-url-with
www/Christina.V.Scott@gmail.com
line.lundvoll.nilsen@telemed.no.
s.hossain@unsw.edu.au 
s.hossain@unsw.edu.au     
Asad
  • 2,256
  • 2
  • 14
  • 17
  • 2
    I tried your regex with my sample text `i opened https://google.com and http://speedtest.net and www.standford.edu` but I don't get proper result. This is how I get `[('https://', 'https', 'm', ''), ('http://', 'http', 't', ''), ('', '', 'u', '')]` – mockash Feb 07 '21 at 13:56
  • It depends on were you are trying. If you are using python(don't need back slash \ chars) or jave or something else. Please try out his one here https://regexr.com/ – Asad Feb 08 '21 at 13:19
  • Unfortunately this approach matches with some unexpected strings like ```matches if you have.any.point that not necessarily is.a.site```, you can paste on pythex.org to see – luisvenezian Mar 23 '21 at 19:08
  • This doesn't recognize strings like `https://google/` which might be used as valid URLs. Your regex requires a `.com` or `.net` at the end. – CaptAngryEyes Apr 17 '21 at 16:36
1

I'll admit that I'm a little bit worried about an application that requires a regex like that to match URLs. That said, this seems to work for me:

((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)
tsm
  • 3,388
  • 2
  • 20
  • 35
1

Based on this link we can use the library validators

For example:

import validators

valid=validators.url('https://codespeedy.com/')
if valid==True:
    print("Url is valid")
else:
    print("Invalid url")
Alireza Mazochi
  • 633
  • 1
  • 6
  • 18
0

This is the most completed pattern I use:

URL_PATTERN = r'[A-Za-z0-9]+://[A-Za-z0-9%-_]+(/[A-Za-z0-9%-_])*(#|\\?)[A-Za-z0-9%-_&=]*'