11

I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?

Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.

from fake_useragent import UserAgent
import requests

def get_player(id,proxy):
    ua=UserAgent()
    headers = {'User-Agent':ua.random}

    url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)

    try:
        print(proxy)
        r=requests.get(u,headers=headers,proxies=proxy)
    execpt:

....
code to manage the data
....

Getting proxies

def get_proxies():
    ua=UserAgent()
    headers = {'User-Agent':ua.random}
    url='https://free-proxy-list.net/'

    r=requests.get(url,headers=headers)
    page = BeautifulSoup(r.text, 'html.parser')

    proxies=[]

    for proxy in page.find_all('tr'):
        i=ip=port=0

    for data in proxy.find_all('td'):
        if i==0:
            ip=data.get_text()
        if i==1:
            port=data.get_text()
        i+=1

    if ip!=0 and port!=0:
        proxies+=[{'http':'http://'+ip+':'+port}]

return proxies

Calling functions

proxies=get_proxies()
for i in range(1,100):
    player=get_player(i,proxies[i//4])

....
code to manage the data  
....

I know that proxies scrape is well because when i print then I see something like: {'http': 'http://88.12.48.61:42365'} I would like to don't get blocked.

  • 1
    I had this problem in the past. Do you know if those proxies are HTTP or HTTPS proxies and whether the server only accepts from a specific type? For me I had the same issue until I learned the server only accepts HTTP proxies but I was feeding it HTTPS proxies. Now my script just runs 24/7 – Edeki Okoh Apr 26 '19 at 17:30
  • It could be possible. I have just tried with HTTPS and it is even worse because I can't access. With HTTP I get a maximun of 6 requests but HTTPS no one. – Javier Jiménez de la Jara Apr 26 '19 at 23:23
  • _quick question_ : What are you trying to scrape that you're getting blocked? – P.hunter Apr 27 '19 at 10:43
  • Is 'tranfermarkt', a football web. Finally I tried with HTTPS proxies but from 'https://hidemyna.me/es/proxy-list/?type=s#list' and it worked. Do you know another free page to get a list? – Javier Jiménez de la Jara Apr 27 '19 at 11:09
  • @JavierJiménezdelaJara does [using a VPN](https://github.com/thispc/psiphon) helps, have you tried scrapy? it might work – P.hunter Apr 27 '19 at 16:27
  • 1
    I used proxybroker (a github package) to get proxies and worked perfectly – Javier Jiménez de la Jara Apr 27 '19 at 16:48
  • great, but i'm still wondering why did the website was blocking your requests after 5 requests? – P.hunter Apr 27 '19 at 16:57
  • Hey @JavierJiménezdelaJara i'm pretty sure i'm doing now what you tried here, but i'm doing for 'ogol' another football web. Do you have any contact to share like your Discord or Telegram, so i could get some tips from you and your code? Thanks! – Emerson Oliveira Mar 16 '21 at 17:13

3 Answers3

16

I recently had this same issue, but using proxy servers online as recommended in other answers is always risky (from privacy standpoint), slow, or unreliable.

Instead, you can use the requests-ip-rotator python library to proxy traffic through AWS API Gateway, which gives you a new IP each time:
pip install requests-ip-rotator

This can be used as follows (for your site specifically):

import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS

gateway = ApiGateway("https://www.transfermarkt.es")
gateway.start()

session = requests.Session()
session.mount("https://www.transfermarkt.es", gateway)

response = session.get("https://www.transfermarkt.es/jadon-sancho/profil/spieler/your_id")
print(response.status_code)

# Only run this line if you are no longer going to run the script, as it takes longer to boot up again next time.
gateway.shutdown() 

Combined with multithreading/multiprocessing, you'll be able to scrape the site in no time.

The AWS free tier provides you with 1 million requests per region, so this option will be free for all reasonable scraping.

George
  • 422
  • 5
  • 10
  • 3
    Amazing tool, thanks for putting it together! – Carlos Souza Sep 20 '21 at 16:28
  • 4
    Thanks! Also, I would like to add, that you need to get your API keys from AWS and add them in this way: `gateway = ApiGateway(site="site.com", access_key_id = AWS_ACCESS_KEY_ID, access_key_secret = AWS_SECRET_ACCESS_KEY)` You can follow [this guide](https://docs.aws.amazon.com/powershell/latest/userguide/pstools-appendix-sign-up.html). how to retrieve your keys. – vekeras Nov 05 '21 at 13:07
  • 1
    Indeed - or optionally if the keys are stored in environment variables then they will be automatically used too, as detailed in [this aws guide](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html) : ) – George Nov 05 '21 at 14:50
8
import requests
from itertools import cycle

list_proxy = ['socks5://Username:Password@IP1:20000',
              'socks5://Username:Password@IP2:20000',
              'socks5://Username:Password@IP3:20000',
               'socks5://Username:Password@IP4:20000',
              ]

proxy_cycle = cycle(list_proxy)
# Prime the pump
proxy = next(proxy_cycle)

for i in range(1, 10):
    proxy = next(proxy_cycle)
    print(proxy)
    proxies = {
      "http": proxy,
      "https":proxy
    }
    r = requests.get(url='https://ident.me/', proxies=proxies)
    print(r.text)
John Wick
  • 81
  • 1
  • 1
5

The problem with using free proxies from sites like this is

  1. websites know about these and may block just because you're using one of them

  2. you don't know that other people haven't gotten them blacklisted by doing bad things with them

  3. the site is likely using some form of other identifier to track you across proxies based on other characteristics (device fingerprinting, proxy-piercing, etc)

Unfortunately, there's not a lot you can do other than be more sophisticated (distribute across multiple devices, use VPN/TOR, etc) and risk your IP being blocked for attempting DDOS-like traffic or, preferably, see if the site has an API for access

Nazim Kerimbekov
  • 4,497
  • 8
  • 31
  • 54
G. Anderson
  • 5,173
  • 2
  • 11
  • 19