How do you extract a url from a string using python?

Question

For example:

string = "This is a link http://www.google.com"

How could I extract 'http://www.google.com' ?

(Each link will be of the same format i.e 'http://')

You might check out this answer: http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link — rjz, Mar 18 '12 at 17:42
If this is for a raw text file (as expressed in your question), you might check this answer: http://stackoverflow.com/questions/839994/extracting-a-url-in-python — Alexandre Dulaunoy, Mar 18 '12 at 17:45
Possible duplicate of [What is the best regular expression to check if a string is a valid URL?](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) — Yash Kumar Verma, Sep 28 '17 at 09:49

score 40 · Accepted Answer · answered Mar 18 '12 at 17:48

40

There may be few ways to do this but the cleanest would be to use regex

>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com

If there can be multiple links you can use something similar to below

>>> myString = "These are the links http://www.google.com  and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>

answered Mar 18 '12 at 17:48

Abhijit

59,056
16
119
195

11

This is too crude for many real-world scenarios. It fails entirely for `ftp://` URLs and `mailto:` URLs etc, and will naïvely grab the tail part from `Click here` (i.e. up through "click"). – tripleee Oct 10 '14 at 10:39
3

@tripleee The question isn't about parsing HTML, but finding a URL in a string of text that will always be `http` format. So this works really well for that. But yes, pretty important for people to know what you're saying if they're here for parsing HTML or similar. – teewuane Nov 16 '16 at 17:42

score 26 · Answer 2 · edited Jan 03 '18 at 12:50

In order to find a web URL in a generic string, you can use a regular expression (regex).

A simple regex for URL matching like the following should fit your case.

    regex = r'('

    # Scheme (HTTP, HTTPS, FTP and SFTP):
    regex += r'(?:(https?|s?ftp):\/\/)?'

    # www:
    regex += r'(?:www\.)?'

    regex += r'('

    # Host and domain (including ccSLD):
    regex += r'(?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)'

    # TLD:
    regex += r'([A-Z]{2,6})'

    # IP Address:
    regex += r'|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

    regex += r')'

    # Port:
    regex += r'(?::(\d{1,5}))?'

    # Query path:
    regex += r'(?:(\/\S+)*)'

    regex += r')'

If you want to be even more precise, in the TLD section, you should ensure that the TLD is a valid TLD (see the entire list of valid TLDs here: https://data.iana.org/TLD/tlds-alpha-by-domain.txt):

    # TLD:
    regex += r'(com|net|org|eu|...)'

Then, you can simply compile the former regex and use it to find possible matches:

    import re

    string = "This is a link http://www.google.com"

    find_urls_in_string = re.compile(regex, re.IGNORECASE)
    url = find_urls_in_string.search(string)

    if url is not None and url.group(0) is not None:
        print("URL parts: " + str(url.groups()))
        print("URL" + url.group(0).strip())

Which, in case of the string "This is a link http://www.google.com" will output:

    URL parts: ('http://www.google.com', 'http', 'google.com', 'com', None, None)
    URL: http://www.google.com

If you change the input with a more complex URL, for example "This is also a URL https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo but this is not anymore" the output will be:

    URL parts: ('https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo', 'https', 'host.domain.com', 'com', '80', '/path/page.php?query=value&a2=v2#foo')
    URL: https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo

NOTE: If you are looking for more URLs in a single string, you can still use the same regex, but just use findall() instead of search().

So, the regex end up being `((?:(https?|s?ftp):\/\/)?(?:www\.)?((?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)([A-Z]{2,6})|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))(?::(\d{1,5}))?(?:(\/\S+)*))`. Also note the [TLD list](https://data.iana.org/TLD/tlds-alpha-by-domain.txt) right now also includes fun endings like `XN--VERMGENSBERATUNG-PWB`, being 24 characters long, which will not be catched by this. — luckydonald, Sep 21 '16 at 13:13
Would be better to add `(?i)` to the pattern - more portable. Also, bear in mind this will match `23.084.828.566` which is not a valid IP address but is a valid float in some locales. — Mr_and_Mrs_D, Feb 28 '18 at 22:39
There's some sort of length limit to this regex e.g: `docs.google.com/spreadsheets/d/10FmR8upvxZcZE1q9n1o40z16mygUJklkXQ7lwGS4nlI` just matches `docs.google.com/spreadsheets/d/10FmR8upvxZcZE1q9n`. — Jorge Orpinel Pérez, Oct 25 '18 at 18:18

score 21 · Answer 3 · edited Jul 06 '17 at 00:21

21

There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:

pip install urlextract

and then you can use it like this:

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']

You can find more info on my github page: https://github.com/lipoja/URLExtract

NOTE: It downloads a list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then it's not for you.

edited Jul 06 '17 at 00:21

Jonathan Leffler

698,132
130
858
1,229

answered Feb 15 '17 at 16:40

1

Works like a charm, and doesn't clutter the rest of my script. – Henrik Aug 30 '20 at 12:58
Unfortunately, this fails whenever there is text (i.e., not space) attached to the beginning or end of the url. e.g. `ok/https://www.duckduckgo.com` won't catch the url in it. – autonopy Jul 30 '21 at 01:59

score 7 · Answer 4 · answered May 28 '18 at 14:13

This extracts all urls with parameters, somehow all above examples haven't worked for me

import re

data = 'https://net2333.us3.list-some.com/subscribe/confirm?u=f3cca8a1ffdee924a6a413ae9&id=6c03fa85f8&e=6bbacccc5b'

WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
re.findall(WEB_URL_REGEX, text)

Comsavvy · Answer 5 · 2021-05-29T10:45:22.703

You can extract any URL from a string using the following patterns,

1.

>>> import re
>>> string = "This is a link http://www.google.com"
>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?'
>>> re.search(pattern, string)
http://www.google.com

>>> TWEET = ('New Pybites article: Module of the Week - Requests-cache '
         'for Repeated API Calls - http://pybit.es/requests-cache.html '
         '#python #APIs')
>>> re.search(pattern, TWEET)
http://pybit.es/requests-cache.html

>>> tweet = ('Pybites My Reading List | 12 Rules for Life - #books '
             'that expand the mind! '
             'http://pbreadinglist.herokuapp.com/books/'
             'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter'
             ' #psychology #philosophy')
>>> re.findall(pattern, TWEET)
['http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter']

to take the above pattern to the next level, we can also detect hashtags including URL the following ways

2.

>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?|#[.\w]*'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']

The above example for taking URL and hashtags can be shortened to

>>> pattern = r'((?:#|http)\S+)'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']

The pattern below can matches two alphanumeric separated by "." as URL

>>> pattern = pattern =  r'(?:http://)?\w+\.\S*[^.\s]'

>>> tweet = ('PyBites My Reading List | 12 Rules for Life - #books '
             'that expand the mind! '
             'www.google.com/telephone/wire....  '
             'http://pbreadinglist.herokuapp.com/books/'
             'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter '
             "http://-www.pip.org "
             "google.com "
             "twitter.com "
             "facebook.com"
             ' #psychology #philosophy')
>>> re.findall(pattern, tweet)
['www.google.com/telephone/wire', 'http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', 'www.pip.org', 'google.com', 'twitter.com', 'facebook.com']

You can try any complicated URL with the number 1 & 2 pattern. To learn more about re module in python, do check this out REGEXES IN PYTHON by Real Python.

Cheers!

score 0 · Answer 6 · answered Feb 01 '22 at 20:55

I've used a slight variation from @Abhijit's accepted answer.

This one uses \S instead of [^\s], which is equivalent but more concise. It also doesn't use a named group, because there is just one and we can ommit the name for simplicity reasons:

import re

my_string = "This is my tweet check it out http://example.com/blah"
print(re.search(r'(https?://\S+)', my_string).group())

Of course, if there are multiple links to extract, just use .findall():

print(re.findall(r'(https?://\S+)', my_string))

How do you extract a url from a string using python?

6 Answers6

Cheers!

Linked

Related