I read through all the scraper detection threads and came up with a whole list of Selenium options, which are necessary.
But...
A few links seem to make problems, unimpressed of which action I take.
One URL, e.g. I cannot scape, is the following: www.mobilityhouse.com/de_de/zubehoer/ladekabel.html
Here is my scraper.
Love to know, what is missing. And since I want to save resources (because of threading later on), I am looking for a headless solution.
Thank you!
Code:
########################
# scraper
########################
def seleniumhtml_url(link):
dic={}
dirname = os.path.dirname(__file__)
filepath = os.path.join(dirname, 'chromedriver')
chrome_options = Options()
chrome_options.add_argument('--incognito')
chrome_options.add_argument("--enable-javascript")
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument(f'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36')
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('disable-infobars')
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path=filepath, chrome_options=chrome_options) # Optional argument, if not specified will search path.
driver.get(link)
#time.sleep(3) # Let the user actually see something!
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
driver.quit()
dic["html"] = html
return(dic)