Scraping seloger.com, using headers and requests

Question

I am trying to execute the following code, but seloger certainly blocks the execution of the script.

https://github.com/edouardmulliez/scraper-seloger/blob/master/seloger_scraper.py

I am looking to integrate headers to get around the issue, as described here

How to use Python requests to fake a browser visit?

however I did not succeed in getting it working. Here is my code below for the while loop.

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import os
import matplotlib.pyplot as plt
import seaborn as sns

locality = []
price = []
room_nb = []
bedroom_nb = []
surface = []
acc_type = []


while (page <= max_pages):
    url = ("http://www.seloger.com/list.htm?org=advanced_search&idtt=2&idtypebien=2,1&cp=75&tri=initial&LISTING-LISTpg=" +
           str(page) + "&naturebien=1,2,4")
    page += 1
    try:
        r = s.get(url, headers=headers)
    except:
        # Stop visiting pages
        break

Try with `HTMLSession` https://www.programcreek.com/python/example/115221/requests_html.HTMLSession — Abhishek Rai, Nov 29 '20 at 18:34

score 0 · Answer 1 · answered Nov 29 '20 at 18:43

The issue may be occurring due to fake useragents or scraping blockage. Try adding a useragent to your scraping scheme.
```
headers = {'User-Agent': 'Mozilla/5.0'}
url = "example.com"
r = requests.get(url, headers=headers)
```
If it is not the issue, I am suggesting you to pyppeteer which starts a real chromium browser and fetches through. Most of the websites can't prohibit the session.
If it is not working, try finding APIs or data carrying mechanisms from the XMLHttpRequests on Network tab. By API parsing you can get datas fast.

Scraping seloger.com, using headers and requests

1 Answers1