0

I have just started using Selenium Web Driver and I am stuck with a problem: I want to download a web page's source to my Java program. I have tried using driver.getPageSource() with HtmlUnit driver but the result I got does not exactly match the result I got when I manually did the following:

right click on the browser -> view page source.

I am not able to figure out what the problem is. Is there a different API for my purpose or am I using the wrong driver here? Should I use a chrome driver instead of the HtmlUnit driver? If yes, how to use the chrome driver?

Here is what I am doing:

    WebDriver driver = new HtmlUnitDriver();
    driver.get(webPage);
    System.out.println(driver.getPageSource());
alex
  • 10,249
  • 13
  • 68
  • 93
Vasanth Nag K V
  • 4,648
  • 5
  • 23
  • 42
  • 1
    What do you mean, "does not exactly match the result"? What's different? Is it giving you the source of another page? Is there JavaScript on the page altering the content? I'm curious if that would affect this. – Avery Nov 13 '13 at 21:29
  • these are some elements that are missing. for example. SHANKAR HARDWARE this doies not come up with the method that i have used as shown in the question. but when i do ciew page source, it will show up.. – Vasanth Nag K V Nov 14 '13 at 04:37

2 Answers2

2

I've just check out Fluent Selenium which uses Firefox WebDriver. It's a testing framework, so don't be surprised by presence of asserting methods. It can be used for crawling. Worked perfectly for me with very little configuration. It requires Maven to run, here is my working example:

package fluent;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.seleniumhq.selenium.fluent.FluentWebDriver;
import org.seleniumhq.selenium.fluent.Period;
import org.seleniumhq.selenium.fluent.TestableString;

import java.util.concurrent.TimeUnit;

import static org.openqa.selenium.By.className;

public class Test {
    public static void main(String[] args) {
        WebDriver driver = new FirefoxDriver();
        FluentWebDriver fwd = new FluentWebDriver(driver);

        driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
        driver.get("http://www.hudku.com/search/business-list/Paint%20%26%20Hardware%20in%20Kanakapura%20Road,%20Bangalore,%20Karnataka,%20India?p=6&h1=mgK%3DFsPlSAsPTaOVwo%2F0FIMA");

        driver.navigate();

        TestableString test = fwd.div(className("heading")).within(Period.secs(3)).getText();

        System.out.println("header: " + test.toString());

        test.shouldContain("Paint");

        System.out.println("all is fine!");
    }
}

My pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>testPrj3</groupId>
    <artifactId>testPrj3</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.seleniumhq.selenium.fluent</groupId>
            <artifactId>fluent-selenium</artifactId>
            <version>1.14.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.hamcrest</groupId>
            <artifactId>hamcrest-all</artifactId>
            <version>1.3</version>
            <scope>test</scope>
        </dependency>

        <!-- If you're needing Coda Hale's Metrics integration (optional) -->
        <dependency>
            <groupId>com.codahale.metrics</groupId>
            <artifactId>metrics-core</artifactId>
            <version>3.0.0</version>
        </dependency>

    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.7</source>
                    <target>1.7</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

UPDATE

FluentLenium seems being a little more popular.

Andrey Chaschev
  • 15,735
  • 5
  • 46
  • 65
  • hi Andrey!!! we have already met i guess, nice to see you back again. Maven?? i have never used it. can you please give some light on this and the script that you have posted.. – Vasanth Nag K V Nov 14 '13 at 10:05
  • hey. Maven is one of the most popular build tools for Java, there is lot info about on the web. In this example you would need it for fetching a dozen of jars from the web and auto-configuring your classpath when running. I recommend you to start with creating a new maven project in . Intellij works for me perfectly: [Creating and importing Maven projects](http://wiki.jetbrains.net/intellij/Creating_and_importing_Maven_projects). This might seem like being a long path, but I'm not aware of any shorter... – Andrey Chaschev Nov 14 '13 at 10:29
  • hi Andrey sorry that i am getting back to you so late. i was just making a research on what maven is. now i know a bit abt maven. All i wanna ask is why did we se maven in the code above??? will it not work if i simply run the java on my eclipse without building it into a jar? (i mean, maven is used to build a jar or equivalent) – Vasanth Nag K V Nov 15 '13 at 20:19
  • hey, you can avoid using Maven. I'm just expecting some pain in this case with a list of required JARs. If it's easier for you, just download them manually. – Andrey Chaschev Nov 15 '13 at 20:36
  • oh i see.. okay okay.. but i have already configured maven now. its fine, i learnt something new.. hey Andrey, can you please do me a favor??? can you just hit to this url - "http://www.justdial.com/Bangalore/Tape-Dealers-%3Cnear%3E-Bangalore-City-Railway-Station/ct-12976/page-5" and download the page source using the program you have mentioned above?? and then mail it to me??? so that i can see if this will really work for my case?? it would be a immense support from your side if you can do that for me. i know i am asking too much, but Please.... rocker.vasanth@gmail.com – Vasanth Nag K V Nov 15 '13 at 20:59
  • awesome Andrey!! thanks a lot for the help. the result you mailed me was the exact thing!! now i am not using maven, but got the jars required but still getting some error while m trying to run the code you have presented above... i am getting this error - **Could not start a new session. Possible causes are invalid address of the remote server or browser start-up failure.** i have googled all round and still not able to colve it.. :( what am i doing wrong here?? i din use maven though. ay more advice on this??? please. – Vasanth Nag K V Nov 16 '13 at 09:09
0

The problem is that the browser sends a string to the webserver that declares what type of browser it is and then the webpage gives you different content, depending on the browser. This is basic web programming fact. Developers have to adjust page content , especially CSS declarations, depending on the browser.

djangofan
  • 26,842
  • 57
  • 182
  • 275
  • hi djangofan, thanks for turning up on my question, the expected result is proper on google chrom so are you saying that i have to use a chrome driver API to download the source code of the web page?? – Vasanth Nag K V Nov 14 '13 at 07:48