How to web scrape after login

Question

I am able to login to the webpage just fine and navigate to the page inside the website "urlCanada". However, when try to load that information into htmlCanada and debug it, it shows me the html of the login screen instead of the html of the navigated page. Am I missing something? Why would htmlCanda be back to the login page if I told it to GetStringAsync from the navigated page?

        var urlCanada = webBrowserCanada.Url;
        //Creates a client for you to store the webpage in
        var httpClientCanada = new HttpClient();
        var htmlCanada = await httpClientCanada.GetStringAsync(urlCanada);
        //Allows parsing the information out
        var htmlDocumentCanada = new HtmlAgilityPack.HtmlDocument();
        htmlDocumentCanada.LoadHtml(htmlCanada);
        //Parse the information
        var ProductsHtml = htmlDocumentCanada.DocumentNode
           .SelectSingleNode("//table[@id='tableid']")
            .Descendants("tr")
            .Skip(1)
            .Where(tr => tr.Elements("td").Count() > 1)
            .Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
            .ToList();

This is the html of the table

<table class="GridViewMFG" rules="all" id="ctl00_mainContent_GridViewIssuedParts" style="width:100%;border-collapse:collapse;" cellspacing="0" cellpadding="4" border="1">
</table>

P.S. When I debug and look at webBrowserCanada.Url it shows the html of the navigated webpage.

Where are you logging into the site? your code doesn't show that piece. Get string async on the url is just getting the html returned from the url provided which I'm assuming is the login url? — Daniel, Feb 13 '19 at 15:56
There's no login code in this snippet which means the server sees just an anonymous request that needs to be logged in. *Don't* use a new HttpClient in the first place. That class is thread-safe and meant to be reused. If the web server uses cookies, you need to keep them from one request to the next. If you create a *new* client each time, the cookies are lost — Panagiotis Kanavos, Feb 13 '19 at 15:56
You have to create an HttpClientHanlder with a CookieContainer [as shown here](https://stackoverflow.com/questions/12373738/how-do-i-set-a-cookie-on-httpclients-httprequestmessage) — Panagiotis Kanavos, Feb 13 '19 at 15:58
@Daniel The urlCanada is the logged in part. I cannot show you the login part as this is a company website. urlCanada is the navigated to url after login. Sorry I cannot show more. — Mr.Finch, Feb 13 '19 at 16:00
@Mr.Finch are you saying that after logging in, a session ID or something is added to the URL? The common and secure way is to use authentication cookies, not modified URLs — Panagiotis Kanavos, Feb 13 '19 at 16:24
@PanagiotisKanavos nothing is added to the url it just somehow gets read as the initial login screen. It doesn't make any sense at to why since that isn't the url I am currently on. I am very unfamiliar with cookies but I do believe that solution would work. — Mr.Finch, Feb 13 '19 at 17:25
@Mr.Finch it makes perfect sense. When you login, the server sets a cookie on the browser. When you try to connect using a *different* browser you'll get the login page again. That's what happened here. You logged in using a WebBrowser control but then tried to access that page using a *different* class that doesn't have the authentication cookie — Panagiotis Kanavos, Feb 13 '19 at 17:29
@PanagiotisKanavos that actually makes perfect sense now. I didn't realize making a new http client was basically a new browser. Thank you for the explaination — Mr.Finch, Feb 13 '19 at 17:33

score 0 · Answer 1 · answered Feb 13 '19 at 17:29

So I was able to find an easy work around. Since webBrowserCanada.Url had the information i needed, I removed these two lines of code.

        var httpClientCanada = new HttpClient();
        var htmlCanada = await httpClientCanada.GetStringAsync(urlCanada);

And replaced it with

        var htmlCanada = webBrowserCanada.DocumentText;

So now the whole code reads

        var htmlCanada = webBrowserCanada.DocumentText;
        //Allows parsing the information out
        var htmlDocumentCanada = new HtmlAgilityPack.HtmlDocument();
        htmlDocumentCanada.LoadHtml(htmlCanada);
        //Parse the information
        var ProductsHtml = htmlDocumentCanada.DocumentNode
           .SelectSingleNode("//table[@id='tableid']")
            .Descendants("tr")
            .Skip(1)
            .Where(tr => tr.Elements("td").Count() > 1)
            .Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
            .ToList();

Why did you use HttpClient at all then, when the *text* that you wanted was already available? — Panagiotis Kanavos, Feb 13 '19 at 17:30
@PanagiotisKanavos I have only done a few web scrapping projects. I thought that is what you had to do because that is the way it was explained in the tutorial. I am not sure why it is even used at all normally. — Mr.Finch, Feb 13 '19 at 17:31
@PanagiotisKanavos Sorry for the waste of time it was just me not understanding how web clients work. — Mr.Finch, Feb 13 '19 at 17:32

How to web scrape after login

1 Answers1