Extract domain name from URL in Python

Question

I am tring to extract the domain names out of a list of URLs. Just like in https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
My problem is that the URLs can be about everything, few examples:
m.google.com => google
m.docs.google.com => google
www.someisotericdomain.innersite.mall.co.uk => mall
www.ouruniversity.department.mit.ac.us => mit
www.somestrangeurl.shops.relevantdomain.net => relevantdomain
www.example.info => example
And so on..
The diversity of the domains doesn't allow me to use a regex as shown in how to get domain name from URL (because my script will be running on enormous amount of urls from real network traffic, the regex will have to be enormous in order to catch all kinds of domains as mentioned).
Unfortunately my web research the didn't provide any efficient solution.
Does anyone have an idea of how to do this ?
Any help will be appreciated !
Thank you

Possible duplicate of [how to get domain name from URL](http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url) — guidot, May 17 '17 at 10:15
Yes, I can use external libs. It is not a duplication (I even attached a link to this thread), I couldn't find a satisfying answer there. — kobibo, May 17 '17 at 10:20
Use [**`urllib.parse`**](https://docs.python.org/3/library/urllib.parse.html) — Peter Wood, May 17 '17 at 10:42
Gather a list of top-level domains, split your url by dots, right-strip your url from TLD, extract name. — Pearley, May 17 '17 at 10:10
Does this answer your question? [Get protocol + host name from URL](https://stackoverflow.com/questions/9626535/get-protocol-host-name-from-url) — philshem, Nov 17 '20 at 20:20

akash karothiya · Accepted Answer · 2018-04-18T07:50:17.637

35

Use tldextract which is more efficient version of urlparse, tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL.

>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'

edited Apr 18 '18 at 07:50

answered May 17 '17 at 10:40

akash karothiya

5,500
1
16
26

7

Note: the `tldextract` library makes an http request upon initial install and creates a cache of the latest tld data. This can raise a permission error for some remote deployments. See here: https://github.com/john-kurkowski/tldextract#note-about-caching – alphazwest Jul 30 '19 at 15:14

score 3 · Answer 2 · answered May 17 '17 at 10:12

3

It seems you can use urlparse https://docs.python.org/3/library/urllib.parse.html for that url, and then extract the netloc.

And from the netloc you could easily extract the domain name by using split

answered May 17 '17 at 10:12

Mariano Anaya

1,126
9
11

1

Thank you for your response, unfortunately, using urlparse on url like `m.city.domain.com` returned me `ParseResult(scheme='', netloc='', path='m.city.domain.com', params='', query='', fragment='')`, while the expected output was `domain` – kobibo May 17 '17 at 11:14
Use a valid URL (//m.city.domain.com/), not a something like (m.city.domain.com). Nobody can guess what did you pass when you removed backslashes. – Nairum Jul 24 '20 at 18:35

score 1 · Answer 3 · answered May 20 '20 at 12:03

1

Simple solution via regex

import re

def domain_name(url):
    return url.split("www.")[-1].split("//")[-1].split(".")[0]

answered May 20 '20 at 12:03

Sharif O

43
1

Gets the first part of the domain, not the actual domain. Only works for things like www.google.com – Steve Gon Oct 08 '20 at 23:35
1

Unreliable solution, avoid. – Pedro Lobito Jan 14 '22 at 00:33

score 0 · Answer 4 · answered May 17 '17 at 10:34

0

With regex, you could use something like this:

(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))

https://regex101.com/r/WQXFy6/5

Notice, you'll have to watch out for special cases such as co.uk.

answered May 17 '17 at 10:34

oddRaven

662
1
7
20

score 0 · Answer 5 · answered May 31 '22 at 16:08

0

Check the replace and split methods.

PS: ONLY WORKS FOR SIMPLE LINKS LIKE https://youtube.com (output=youtube) AND (www.user.ru.com) (output=user)

def domain_name(url):

return url.replace("www.","http://").split("//")[1].split(".")[0]

answered May 31 '22 at 16:08

Denis

1
2

Extract domain name from URL in Python

5 Answers5

Linked

Related