3

I have a script parsing a list with thousands of urls. But my problem is, that it would take ages to be done with that list.

The URL request takes around 4 seconds before the page is loaded and can be parsed. Is there any way to parse really a large amount of URLS in a fast way?

My code looks like this:

from bs4 import BeautifulSoup   
import requests                 

#read url-list
with open('urls.txt') as f:
    content = f.readlines()
# remove whitespace characters
content = [line.strip('\n') for line in content]

#LOOP through urllist and get information
for i in range(5):
    try:
        for url in content:

            #get information
            link = requests.get(url)
            data = link.text
            soup = BeautifulSoup(data, "html5lib")

            #just example scraping
            name = soup.find_all('h1', {'class': 'name'})

EDIT: how to handle Asynchronous Requests with hooks in this example? I tried the following as mentioned on this site Asynchronous Requests with Python requests:

from bs4 import BeautifulSoup   
import grequests

def parser(response):
    for url in urls:

        #get information
        link = requests.get(response)
        data = link.text
        soup = BeautifulSoup(data, "html5lib")

        #just example scraping
        name = soup.find_all('h1', {'class': 'name'})

#read urls.txt and store in list variable
with open('urls.txt') as f:
    urls= f.readlines()
# you may also want to remove whitespace characters 
urls = [line.strip('\n') for line in urls]

# A list to hold our things to do via async
async_list = []

for u in urls:
    # The "hooks = {..." part is where you define what you want to do
    # 
    # Note the lack of parentheses following do_something, this is
    # because the response will be used as the first argument automatically
    rs = grequests.get(u, hooks = {'response' : parser})

    # Add the task to our list of things to do via async
    async_list.append(rs)

# Do our list of things to do via async
grequests.map(async_list, size=5)

This doesn't work for me. I don't even get any error in the console, it is just running for long time until it stops.

kratze
  • 156
  • 10
  • 4
    The documentation is your friend: http://docs.python-requests.org/en/v0.10.6/user/advanced/#asynchronous-requests – Tomalak Sep 08 '17 at 13:19
  • I would suggest to break your URL list and put time gaps between requests, exactly what @Tomalak suggests – chad Sep 08 '17 at 13:20
  • @Tomalak you should make that an answer cause it solves the user's issue at a first problem. – Horia Coman Sep 08 '17 at 13:22
  • For faster parsing, use `"lxml"` instead of `"html5lib"`. [See here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) – MD. Khairul Basar Sep 08 '17 at 13:23
  • Thank you very much, could you explain how this exactly works? Now i am iterating through my url list and for each url i am making a request, when the request ist made then i parse the page and store result in a database, then the loop makes the next request. Is this method possible to combine then? – kratze Sep 08 '17 at 13:28
  • 1
    The documentation describes how to use event hooks. Just put your content-handling code into a function and hock that to the `response` event. This works exactly the same way for `requests.get()` and `async.get()`. – Tomalak Sep 08 '17 at 13:36
  • 1
    I edited the code, is that the way how to use it? Sorry i find the documentation for beginners a bit poor. But i also noticed, that this isn't available in Python3. I wrote my whole script with Python3 – kratze Sep 08 '17 at 14:51
  • 1
    Not quite, see here for working code: https://stackoverflow.com/questions/9110593/asynchronous-requests-with-python-requests. As per the comments to the second answer in the same thread, you can use [async io for HTTP](https://pypi.python.org/pypi/aiohttp) on Python3. This blog post seems to address that in detail: https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html – Tomalak Sep 08 '17 at 14:56
  • This looks promising and understandable, thank you ! – kratze Sep 08 '17 at 14:59
  • 1
    Sorry for not writing that as an answer, but it would take me more time than I have today to write a useful one. I'm confident you can transform the code in the blog post to a solution - once you've figured it out, please post your own answer. I'll stick around and upvote. – Tomalak Sep 08 '17 at 15:02

1 Answers1

1

If someone is curious about this question - i decided to start my project again from zero and use scrapy instead of beautifulsoup.

Scrapy is a full framework for webscraping and it has builtin features for handling 1000's of requests at once and you can throttle your requests down that you scrape "more friendly" from your destinated site.

I hope this might help someone. For me it was the better choice for this project.

kratze
  • 156
  • 10