2

So I am trying to scrape this website: https://www.auto24.ee I was able to scrape data from it without any problems, but today it gives me "Response 403". I tried using proxies, passing more information to headers, but unfortunately nothing seems to work. I could not find any solution on the internet, I tried different methods. The code that worked before without any problems:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
}

page = requests.get("https://www.auto24.ee/", headers=headers)

print(page)
Icecreamz
  • 33
  • 5

2 Answers2

5

The code here

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
page = requests.get("https://www.auto24.ee/", headers=headers)
print(page.text)

Always will get something as the following

 <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>

            <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
          </div>

          <div class="cf-column">
            <h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>


            <p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can 
run an anti-virus scan on your device to make sure it is not infected with malware.</p>

The website is protected by CloudFlare. By standard means, there is minimal chance of being able to access the WebSite through automation such as requests or selenium. You are seeing 403 since your client is detected as a robot. There may be some arbitrary methods to bypass CloudFlare that could be found elsewhere, but the WebSite is working as intended. There must be a ton of data submitted through headers and cookies that show your request is valid, and since you are simply submitting only a user agent, CloudFlare is triggered. Simply spoofing another user-agent is not even close to enough to not trigger a captcha, CloudFlare checks for MANY things.

I suggest you look at selenium here since it simulates a real browser, or research guides to (possibly?) bypass Cloudflare with requests.

Update Found 2 python libraries cloudscraper and cfscrape. Both are not usable for this site since it uses cloudflare v2 unless you pay for a premium version.

KMM
  • 646
  • 1
  • 13
  • Thanks for your response, I did not realize it myself. Atleast now I know the cause. Unfortunately it´s not easy to develop a captcha solver for this one. – Icecreamz Dec 15 '21 at 21:31
  • 1
    Cloud flare exists for a reason sadly! I’m sure there are extremely difficult ways to get past it. I was looking at some of the cookies and saw there were some cookies that were linked to the current time and date, and those could possibly be manipulated to bypass it. Other than that this is beyond me. If you get the chance, accept my answer so others will be able to solve this also. Have a nice day! – KMM Dec 15 '21 at 21:38
0

You need to find User-Agent. So, open the browser and find the User-Agent header of the GET request from developer tools or press Ctrl+Shift+I

Here is how you can find the User-Agent for different browsers.

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
page = requests.get("https://www.auto24.ee/", headers=headers)
print(page)

Also, try to use requests.Session()

import requests
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
page = session.get("https://www.auto24.ee/", headers=headers)
print(page.content.decode())

The session may solve the problem or install cfscrape. You can find the answer here Python - Request being blocked by Cloudflare

Update

Try pip install cloudscraper -U for more info see A Python module to bypass Cloudflare's anti-bot page. In the issues, you will find a fix for Cloudflare v2

I_Al-thamary
  • 2,462
  • 1
  • 20
  • 33
  • 1
    As I said, passing the user-agent does not help. – Icecreamz Dec 15 '21 at 21:04
  • I have updated it. Try it now. – I_Al-thamary Dec 15 '21 at 21:17
  • 1
    As @Keegan M said, it is caused by the CloudFlare protection, therefore not an easy task to solve. – Icecreamz Dec 15 '21 at 21:33
  • The session may solve the problem or install `cfscrape` [Python Request being blocked by Cloudflare](https://stackoverflow.com/questions/49087990/python-request-being-blocked-by-cloudflare) – I_Al-thamary Dec 15 '21 at 21:37
  • 2
    a session alone makes no difference. Both cfscape and cloudscraper are not usable since this is cloudflare v2 – KMM Dec 15 '21 at 21:53
  • Try `pip install cloudscraper -U` for more info see [A Python module to bypass Cloudflare's anti-bot page](https://pythonrepo.com/repo/VeNoMouS-cloudscraper-python-web-crawling). In the issues, you will find a fix for v2 – I_Al-thamary Dec 15 '21 at 22:16