1

I read through all the scraper detection threads and came up with a whole list of Selenium options, which are necessary.

But...

A few links seem to make problems, unimpressed of which action I take.

One URL, e.g. I cannot scape, is the following: www.mobilityhouse.com/de_de/zubehoer/ladekabel.html

Here is my scraper.

Love to know, what is missing. And since I want to save resources (because of threading later on), I am looking for a headless solution.

Thank you!

Code:

########################
# scraper
########################

def seleniumhtml_url(link):
    dic={}    
    dirname = os.path.dirname(__file__)
    filepath = os.path.join(dirname, 'chromedriver')
    
    chrome_options = Options()
    chrome_options.add_argument('--incognito')
    chrome_options.add_argument("--enable-javascript")
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument(f'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36')
    chrome_options.add_argument('--disable-extensions')
    chrome_options.add_argument('start-maximized')
    chrome_options.add_argument('disable-infobars')
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    driver = webdriver.Chrome(executable_path=filepath, chrome_options=chrome_options)  # Optional argument, if not specified will search path.
    driver.get(link)
    #time.sleep(3) # Let the user actually see something!

    html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
    driver.quit()
    
    dic["html"] = html
   
    return(dic)
Mike89
  • 43
  • 2

0 Answers0