165

I want to get the content from this website.

If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page.

I thought the developer of the website had made some blocks for this.

Question

How do I fake a browser visit by using python requests or command wget?

Federico Baù
  • 3,772
  • 4
  • 18
  • 27
user1726366
  • 2,006
  • 3
  • 14
  • 15

8 Answers8

367

Provide a User-Agent header:

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:


As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'
wittrup
  • 1,495
  • 1
  • 11
  • 23
alecxe
  • 441,113
  • 110
  • 1,021
  • 1,148
  • 2
    thanks for your answer, I tried with the headers in my requests but still could not get the real content of the page, there's a string 'Your web browser must have JavaScript enabled in order for this application to display correctly.' in the returned html page, should I add java script support in the requests? If so how would I do that? – user1726366 Dec 26 '14 at 04:32
  • 9
    @user1726366: You can't simply add JavaScript support - you need a JavaScript interpreter for that. The simplest approach is to use the JavaScript interpreter of a real Web browser, but you can automate that from Python using [Selenium](https://pypi.python.org/pypi/selenium). – PM 2Ring Dec 26 '14 at 05:44
  • 1
    @alecxe,@sputnick: I tried to capture the packets with wireshark to compare the difference from using python requests and browser, seems like the website url isn't a static one I have to wait for the page render to complete, so **Selenium** sounds the right tools for me. Thank you for your kind help. :) – user1726366 Dec 26 '14 at 09:26
  • Turns out some search engines filter some `UserAgent`. Anyone know why ? Could anyone provide a list of acceptable `UserAgent`s ? – dallonsi Aug 06 '20 at 12:18
  • 2
    This is the top User-Agent attacking us nowadays, I wonder why > – mveroone Jan 12 '21 at 16:35
  • The link to [List of all Browsers](http://www.useragentstring.com/pages/useragentstring.php) seems to be dead now. – Gino Mempin Feb 27 '21 at 04:19
  • Is this legal? What if I develop a mobile app that uses this in the backend and some website gets high enough traffic to cause problems? – mLstudent33 Aug 25 '21 at 23:44
49

I used fake UserAgent.

How to use:

from fake_useragent import UserAgent
import requests
   

ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

Output:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>
MendelG
  • 8,523
  • 3
  • 16
  • 34
Umesh Kaushik
  • 1,461
  • 1
  • 16
  • 18
  • still getting Error 404 – Yaroslav Dukal May 13 '18 at 23:16
  • 1
    404 is different error, you sure you are able to browse the page using a browser? – Umesh Kaushik May 15 '18 at 08:08
  • Absolutely. I feel like the web site that I am trying to use blocked all Amazon EC2 IPs. – Yaroslav Dukal May 16 '18 at 08:58
  • Could you please ping the link here? I can try at my end. Further if IP is blocked then error code should be 403(forbidden) or 401(unauthorised). There are websites which do not allow scraping at all. Further many websites user cloudflare to avoid bots to access website. – Umesh Kaushik May 17 '18 at 05:19
  • Here is my link http://regalbloodline.com/music/eminem. It worked fine before. Stopped working on python 2. Worked on python 3 on local machine. Move to AWS EC2 did not work there. Kept getting Error 404. Then stopped working on local machine too. Using browser emulation worked on local machine but not on EC2. In the end I gave up and found alternative website to scrape. By the way is cloudfire could be avoided ? – Yaroslav Dukal May 17 '18 at 06:13
  • Hi I checked the link, there can be a scenario that they have blocked the access for proxies and other servers like aws. I tried from my laptop(I live in India) it worked fine, now I used the vpn which provides me the IP of Singapore and it gave 404 in browser as well. After that I tried changing country with the use of hola vpn extension and tried 4-5 countries but same behaviour was noticed. – Umesh Kaushik May 17 '18 at 06:28
  • Yea most likely, so you would recommend not to mess with it? Also, they block when I tried to download the audio as well. – Yaroslav Dukal May 17 '18 at 06:33
  • Using code as well, while using without any VPN it works fine returns status code 200. but with vpn it gives 404 `404 404 Not Found

    404 Not Found


    nginx
    `
    – Umesh Kaushik May 17 '18 at 06:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/171209/discussion-between-umesh-kaushik-and-maksim-kniazev). – Umesh Kaushik May 17 '18 at 06:35
9

Try doing this, using firefox as fake user agent (moreover, it's a good startup script for web scraping with the use of cookies):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

USAGE:

python script.py "http://www.ichangtou.com/#company:data_000008.html"
Gilles Quenot
  • 154,891
  • 35
  • 213
  • 206
5

The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.

So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.

Daniel Butler
  • 2,515
  • 2
  • 19
  • 33
2

Answer

You need to create a header with a proper formatted User agent String, it server to communicate client-server.

You can check your own user agent Here.

Example

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0

Third party Package user_agent 0.1.9

I found this module very simple to use, in one line of code it randomly generates a User agent string.

from user_agent import generate_user_agent, generate_navigator
from pprint import pprint

print(generate_user_agent())
# 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'

print(generate_user_agent(os=('mac', 'linux')))
# 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:36.0) Gecko/20100101 Firefox/36.0'

pprint(generate_navigator())

# {'app_code_name': 'Mozilla',
#  'app_name': 'Netscape',
#  'appversion': '5.0',
#  'name': 'firefox',
#  'os': 'linux',
#  'oscpu': 'Linux i686 on x86_64',
#  'platform': 'Linux i686 on x86_64',
#  'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64; rv:41.0) Gecko/20100101 Firefox/41.0',
#  'version': '41.0'}

pprint(generate_navigator_js())

# {'appCodeName': 'Mozilla',
#  'appName': 'Netscape',
#  'appVersion': '38.0',
#  'platform': 'MacIntel',
#  'userAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Firefox/38.0'}
Federico Baù
  • 3,772
  • 4
  • 18
  • 27
2

I use pyuser_agent. this package use get user agnet

import pyuser_agent
import requests

ua = pyuser_agent.UA()

headers = {
      "User-Agent" : ua.random
}
print(headers)

uri = "https://github.com/THAVASIGTI/"
res = requests.request("GET",uri,headers=headers)
print(res)

console out

{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/533+ (KHTML, like Gecko)'}
<Response [200]>
THAVASI.T
  • 49
  • 3
0

I had a similar issue but I was unable to use the UserAgent class inside the fake_useragent module. I was running the code inside a docker container

import requests
import ujson
import random

response = requests.get('https://fake-useragent.herokuapp.com/browsers/0.1.11')
agents_dictionary = ujson.loads(response.text)
random_browser_number = str(random.randint(0, len(agents_dictionary['randomize'])))
random_browser = agents_dictionary['randomize'][random_browser_number]
user_agents_list = agents_dictionary['browsers'][random_browser]
user_agent = user_agents_list[random.randint(0, len(user_agents_list)-1)]

I targeted the endpoint used in the module. This solution still gave me a random user agent however there is the possibility that the data structure at the endpoint could change.

0

This is how, I have been using a random user agent from a list of nearlly 1000 fake user agents

from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.ANDROID.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value, OperatingSystem.MAC.value]   

user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=1000)

# Get list of user agents.
user_agents = user_agent_rotator.get_user_agents()

user_agent_random = user_agent_rotator.get_random_user_agent()

Example

print(user_agent_random)

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

For more details visit this link