4

i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.

i have the following code,

Document doc = Jsoup.connect(url).timeout(30000).get();

Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code. After searching some sites on Google, i saw this methid,

URL url = new URL(webPage);
        URLConnection urlConnection = url.openConnection();
        urlConnection.setConnectTimeout(10000);
        urlConnection.setReadTimeout(10000);
        InputStream is = urlConnection.getInputStream();
        InputStreamReader isr = new InputStreamReader(is);



        int numCharsRead;
        char[] charArray = new char[1024];
        StringBuffer sb = new StringBuffer();
        while ((numCharsRead = isr.read(charArray)) > 0) {
            sb.append(charArray, 0, numCharsRead);
        }
        String result = sb.toString();          

        System.out.println(result);   

But no Luck. While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!

Vasanth Nag K V
  • 4,648
  • 5
  • 23
  • 42
  • Maybe the page source is malformed HTML and JSoup is cleaning it up by removing the invalid sections? – dnault Nov 13 '13 at 18:53
  • hi dnault, the webpage is from a well hosted website so i think it would have gone through validations and it might not be a malformed HTML – Vasanth Nag K V Nov 13 '13 at 18:57

3 Answers3

5

I had this recently. I'd run into some sort of robot protection. Change your original line to:

Document doc = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0")
                    .timeout(30000)
                    .get();
Andrii Abramov
  • 8,945
  • 8
  • 66
  • 87
cftygv
  • 51
  • 1
  • COOL!! nice cftygv!! i went through lot of troubles learning selenium, maven and all.. din know there was such a simple solution with Jsoup. and yeah with wat you have given it works more faster for me.., but it was a nice experiance for me learning all the super stuff!! thanks a lot!! – Vasanth Nag K V Nov 16 '13 at 19:23
  • hi cftygv, you had answered my above question and your solution was working fine, but now again when i use your solution for a different website i am facing teh same problem again – Vasanth Nag K V Mar 18 '15 at 21:17
3

The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.

UPDATE

There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.

JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.

So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.

And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.

If you want the most real page rendering, you might want to consider Selenium WebDriver.

Community
  • 1
  • 1
Andrey Chaschev
  • 15,735
  • 5
  • 46
  • 65
  • hi Andrey, can you please elaborate?? i did not understand much with the link you gave me. i have just started using Jsoup so i am not very familiar with its functionalities. Would love to know more. :) – Vasanth Nag K V Nov 13 '13 at 19:00
  • Updated - HtmlUnit now can use WebDriver. – Andrey Chaschev Nov 13 '13 at 19:25
  • Andrey, may i know what is this web driver that you have mentioned?? is it a program that can be used in my java program and in turn does the job of the web browser for me in the java program?? Please explain this - "a web driver which emulates a web browser and renders a page in memory" and "mouse clicks with this driver". – Vasanth Nag K V Nov 13 '13 at 19:28
  • There you go: http://www.seleniumhq.org/projects/webdriver/. I don't very much of the recent details. Basically, WebDriver is an abstraction, which allows you rendering and navigating pages in memory. It can be a Chromium Driver, a Firefox Driver, etc. – Andrey Chaschev Nov 13 '13 at 19:37
  • thanks for the help Andrey, any idea on how to use these web drivers in my code? any API's ? i am guessing that the flow would be somewhat like this, java code -> webDriver -> render page -> get result -> parse with Jsoup Please corret me if i am wrong. – Vasanth Nag K V Nov 13 '13 at 19:46
  • Yes, you're right. Take a look at example (I haven't tried it!): https://github.com/SeleniumHQ/www.seleniumhq.org/blob/master/src/main/rst/examples/Chapter4/Java/HtmlUnitExample.java. – Andrey Chaschev Nov 13 '13 at 20:01
  • i tried this Andrey, but again no luck but there was some improvement, i guser the driver.getPageSource method. but still the exact thing like how we see in the browser's page source din come up :( :( am i using the wrong API here? – Vasanth Nag K V Nov 13 '13 at 20:54
  • I haven't used Selenium directly for two or three years... Could you post with your new issue as different question so that guys with relevant qualification could see them and help you out? – Andrey Chaschev Nov 13 '13 at 20:57
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/41132/discussion-between-user2781031-and-andrey-chaschev) – Vasanth Nag K V Nov 13 '13 at 21:06
  • Hi Andrey, please take a look at the answer from cftygv in the trail here. its gotta more simpler solution with Jsoup itself. just telling you in case you might be interested to know it. :) but for all the things you taut me, thank you!! – Vasanth Nag K V Nov 16 '13 at 19:26
1

Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.

To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.

The following could be few of the reasons that cause these differences:

  1. Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.

  2. Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.

  3. Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.

in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.

URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.

Community
  • 1
  • 1
vasanth
  • 130
  • 1
  • 1
  • 11
  • hi vasanth, please follow my comments in the answer just below yours, where i have talked to Andrey about the problem i am facing. i started with Selenium too, but still no luck, thanks lot for turning up on my question. – Vasanth Nag K V Nov 13 '13 at 21:05