I would like to download bulk images, using Google image search.
My first method; downloading the page source to a file and then opening it with open() works fine, but I would like to be able to fetch image urls by just running the script and changing keywords.
First method: Go to the image search (https://www.google.no/search?q=tower&client=opera&hs=UNl&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM5fnf4_zKAhWIJJoKHYUdBg4Q_AUIBygB&biw=1920&bih=982). View the page source in the browser and save it to a html file. When I then open() that html file with the script, the script works as expected and I get a neat list of all the urls of the images on the search page. This is what line 6 of the script does (uncomment to test).
If, however I use the requests.get() function to parse the webpage, as shown in line 7 of the script, it fetches a different html document, that does not contain the full urls of the images, so I cannot extract them.
Please help me extract the correct urls of the images.
Edit: link to the tower.html, I am using: https://www.dropbox.com/s/yy39w1oc8sjkp3u/tower.html?dl=0
This is the code, I have written so far:
import requests
from bs4 import BeautifulSoup
# define the url to be scraped
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'
# top line is using the attached "tower.html" as source, bottom line is using the url. The html file contains the source of the above url.
#page = open('tower.html', 'r').read()
page = requests.get(url).text
# parse the text as html
soup = BeautifulSoup(page, 'html.parser')
# iterate on all "a" elements.
for raw_link in soup.find_all('a'):
link = raw_link.get('href')
# if the link is a string and contain "imgurl" (there are other links on the page, that are not interesting...
if type(link) == str and 'imgurl' in link:
# print the part of the link that is between "=" and "&" (which is the actual url of the image,
print(link.split('=')[1].split('&')[0])