How to replace all URLs in a string with their hostname and tld (e.g. google.com)

Question

I have a string with multiple URLs and some text in between.

How can I replace each URL with their hostname and top-level-domain?

Example Input: www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask

Desired Output: google.com some text google.com some text google.com some text stackoverflow.com

I've found the Python module tldextract but that just helps with extracting hostname + tld but not with finding and replacing all URLs

Thanks in advance!

@JammyDodger I don't know how, since the URLs are all different... I would have to do that dynamically — Tom, Aug 03 '19 at 20:54
Possible duplicate of [Get protocol + host name from URL](https://stackoverflow.com/questions/9626535/get-protocol-host-name-from-url) — clubby789, Aug 03 '19 at 20:56
@tom you can use urlparse lib to extract top level domain, for each url in your string. — Luiz Lai, Aug 03 '19 at 20:56

calestini · Answer 1 · 2019-08-03T21:35:00.497

You can also use regex with the logic below:

(http[s]?://) --> Capture http:// or https://
(www\.) --> Capture www.
(?<=.[a-z][a-z][a-z])(/[^ ]*) Capture anything past .com with slashes, excluding .com (also other domains, like org, net, as long as 3-letter long)

yourString = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'

re.sub(r'(http[s]?://)|(?<=.com)(/[^ ]*)|(www\.)', '', yourString)

Out[1]:'google.com some text google.com some text google.com some text stackoverflow.com'

score 0 · Accepted Answer · answered Aug 03 '19 at 20:58

You could just replace 'www' (etc.) with '' for the part before the domain, but that solution ignores everything after the suffix which can't be predicted.

Try this:

import tldextract

somestr = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'

newstr = ''

for word in somestr.split(' '):
    extracted = tldextract.extract(word)
    if extracted.domain != '' and extracted.suffix != '':
        newstr += extracted.domain + '.' + extracted.suffix + ' '
    else:
        newstr += word + ' '

print(newstr)

How to replace all URLs in a string with their hostname and tld (e.g. google.com)

2 Answers2