0

I have a string with multiple URLs and some text in between.

How can I replace each URL with their hostname and top-level-domain?

Example Input: www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask

Desired Output: google.com some text google.com some text google.com some text stackoverflow.com

I've found the Python module tldextract but that just helps with extracting hostname + tld but not with finding and replacing all URLs

Thanks in advance!

Tom
  • 101
  • 1
  • 7

2 Answers2

1

You can also use regex with the logic below:

  1. (http[s]?://) --> Capture http:// or https://
  2. (www\.) --> Capture www.
  3. (?<=.[a-z][a-z][a-z])(/[^ ]*) Capture anything past .com with slashes, excluding .com (also other domains, like org, net, as long as 3-letter long)
yourString = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'

re.sub(r'(http[s]?://)|(?<=.com)(/[^ ]*)|(www\.)', '', yourString)

Out[1]:'google.com some text google.com some text google.com some text stackoverflow.com'
calestini
  • 3,354
  • 6
  • 20
  • 31
0

You could just replace 'www' (etc.) with '' for the part before the domain, but that solution ignores everything after the suffix which can't be predicted.

Try this:

import tldextract

somestr = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'

newstr = ''

for word in somestr.split(' '):
    extracted = tldextract.extract(word)
    if extracted.domain != '' and extracted.suffix != '':
        newstr += extracted.domain + '.' + extracted.suffix + ' '
    else:
        newstr += word + ' '

print(newstr)
aybry
  • 301
  • 1
  • 7