0

I am using JSoup to crawl a site but it redirects to a new page using javascript. I am sure it is not using 302 redirect because it will stop redirect when I turn off my browser's javascript. Is there a way to allow JSoup to auto follow javascript redirect? If not, what other alternatives allow javascript redirect?

angelokh
  • 9,344
  • 8
  • 66
  • 130

1 Answers1

2

Jsoup is a parser. It doesn't include a javascript execution engine, so it cannot execute javascript. In order to execute javascript you will have to use a headless browser, like selenium webdriver.

One other alternative is to parse the javascript (as text) that is responsible for the redirect and extract the url. After that you just do what you normally do in order to scrape a site. But this is a "hack", it's not automatic, and I don't know if it's generic enough for your needs.

Alkis Kalogeris
  • 15,894
  • 13
  • 56
  • 106
  • The site is protected by scrapping service like ShieldSquare and DistillNetwork. Will webdriver still work? The javascript on the page has been uglified so it is not possible to get the url. But I think those protection service also creates a finger print too. Do you have any experience on this? – angelokh Jul 29 '15 at 07:46
  • Unfortunately no. But, I assume, even these services depend on the headers sent by the client. Selenium imitates a regular browser, so if you set the headers correctly (userAgent etc) I don't believe there will be a problem. The server won't be able to identify any difference between a headless browser and a regular browser. But as I've said, I have no experience with these services so take what I'm saying with a grain of salt. In order to check the headers sent by your browser check this http://stackoverflow.com/questions/31549799/using-jsoup-to-login-to-coned-website/31570494#31570494 – Alkis Kalogeris Jul 29 '15 at 08:08