0

There are lots of selenium scraping questions, but none that fits my problem. I want to download ~2000 images located inside a scroll element. The images are only loaded when scrolled past. Inspecting the HTML, the elements are not there at all before scrolling, which seems different to lazy loading. In any case this solution does not work. What does work is to scroll incrementally through the scroll element, and download all available images after each little scoll.

import numpy as np
from Selenium import webdriver
from os.path import join
from bs4 import BeautifulSoup
import time

wd = webdriver.Chrome('chromedriver')

# Find height of the scroll element
js_get_scroll_height = f'return document.querySelector(<ugly-hack-to-select-correct-element>).scrollHeight'
scroll_height = wd.execute_script(js_get_scroll_height)
heights_to_scroll = np.arange(0,scroll_height,1200)

# Incrementally scroll and download
for next_height in heights_to_scroll:
    js_scroll = f'document.querySelector(<ugly-hack-to-select-correct-element>).scrollTo(0,{next_height})'
    wd.execute_script(js_scroll)
    time.sleep(2)
    soup = BeautifulSoup(wd.page_source, 'html.parser')
    new_img_urls = [item['src'] for item in soup.find_all('img')]
    for img_url in new_img_urls:
        fpath = join(model_name,img_url + '.jpg')
        if os.path.isfile(fpath):
            print('skipping {file_path} because already exists')
        else:
            saved_image(fpath,img_url)

The problem is that the images only load, ie appear in the HTML, if the chrome window generated by selenium is in the foreground of my computer. I need to actually be looking at the page while the scrolling is happening. If I am instead doing something else on my computer or if it goes to sleep (which happens as this whole process takes ~10min), then new_img_urls just ends up as an empty list.

This will not work for me because I want to be able to repeat the process on a few dozen different webpages (with the same format). How can I achieve the same thing with Chrome in the background?

ludog
  • 54
  • 1
  • 8

0 Answers0