1

My company wants me to grab data from their internal website, organize it, and send it to a database. The data is displayed on tables that you navigate to within the site. I'm wanting to pull the fields into a file or memory for further processing.

So far, I can log into the site in powershell by getting the submit login button's ID, and passing my username/password. I'm able to pass use the navigate method to change the page to the appropriate page within the site. However, running an Invoke-WebRequest on the new page, as well as using the Net.WebClient on the new page is returning the information found on the original site's login screen(I know, because nothing from the table makes it into the returned values, regardless of the commands I use). The commented code is what I've tried previously.

Here is the code-minus the values of my id/password/site link

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
$ie = New-Object -ComObject 'internetExplorer.Application'
$ie.Visible= $true # Make it visible
$username="myid"
$password="mypw"
$ie.Navigate("https://webpage.com/index.jsp")
While ($ie.Busy -eq $true) {Start-Sleep -Seconds 3;}
$usernamefield = $ie.document.getElementByID('login')
$usernamefield.value = "$username"
$passwordfield = $ie.document.getElementByID('password')
$passwordfield.value = "$password"
$Link = $ie.document.getElementByID('SubmitLogin')
$Link.click()
$url = "https://webpage.com/home.pa#%5BT1%2CM181%5D"
$ie.Navigate($url) 
While ($ie.Busy -eq $true) {Start-Sleep -Seconds 3;}
$doc = $ie.document
$web = New-Object Net.WebClient
$web.DownloadString($url)
#$r = Invoke-WebRequest $url
#$r.Forms.fields | get-member
#$InnerText = $r.AllElements | 
#    Where-Object {$_.tagName -ne "TD" -and $_.innerText -ne $null} | 
#    Select -ExpandProperty innerText
#write-host $InnerText
#$r.AllElements|Where-Object {$_.InnerHtml -like "*=*"} 

#$doc = $ie.Document
#$doc.getElementByID("ext-element-7") | % {
#    if ($_.id -ne $null){
#        write-host $_.id
#    }
#}
$ie.Quit()
King of NES
  • 231
  • 1
  • 3
  • 13
  • Have you looked at the answers [this](https://stackoverflow.com/questions/25940510/how-to-extract-specific-tables-from-html-file-using-native-powershell-commands) question? – TheMadTechnician Sep 17 '18 at 19:56
  • I just looked at that question, and I'm not able to pull any data using the methods/code listed. I keep getting a null-valued expression-so I think there is something still missing. – King of NES Sep 17 '18 at 20:18
  • Have you tried not using the IE com object? – Maximilian Burszley Sep 17 '18 at 23:06

2 Answers2

2

I obviously don't have your page and can't ensure that the body of the POST from signing in contains the fields login and password so that will require some trial & error from you. As a mini-example, if you open up your console dev tools network tab and filter by POST, you can observe how your login page signs you in. When I open reddit to sign in, it sends a POST to https://www.reddit.com/login with a body containing a username and password key/value (both plaintext). This action sets up my browser session to persist my login.


Here's a code example that uses the HtmlAgilityPack library to interact with the resulting page as if it were XML.

Enabling TLS1.2:

[System.Net.ServicePointManager]::SecurityProtocol =
    [System.Net.ServicePointManager]::SecurityProtocol -bor [System.Net.SecurityProtocolType]::Tls12

Setting up your web session:

$iwrParams = @{
    'Uri'             = 'https://webpage.com/index.jsp'
    'Method'          = 'POST'
    'Body'            = @{
        'login'    = $username
        'password' = $password
    }
    'SessionVariable' = 'session'
    # avoids cases where IE has not been opened
    'UseBasicParsing' = $true
}
# don't care about response - only here to initialize the session
$null = Invoke-WebRequest @iwrParams

Getting the protect page content:

$iwrParams = @{
    'Uri'             = 'https://webpage.com/home.pa#%5BT1%2CM181%5D'
    'WebSession'      = $session
    'UseBasicParsing' = $true
}
$output = (Invoke-WebRequest @iwrParams).Content

Downloading/adding HtmlAgility:

if (-not (Test-Path -Path "$PSScriptRoot\HtmlAgilityPack.dll" -PathType Leaf))
{
    Invoke-WebRequest -Uri https://www.nuget.org/api/v2/package/HtmlAgilityPack -OutFile "$PSScriptRoot\html.zip"
    Expand-Archive -Path "$PSScriptRoot\html.zip" -DestinationPath "$PSScriptRoot\html" -Force
    Copy-Item -Path "$PSScriptRoot\html\lib\netstandard2.0\HtmlAgilityPack.dll" -Destination "$PSScriptRoot\"
    Remove-Item -Path "$PSScriptRoot\html", "$PSScriptRoot\html.zip" -Recurse -Force
}

Add-Type -Path "$PSScriptRoot\HtmlAgilityPack.dll"
$html = [HtmlAgilityPack.HtmlDocument]::new()

Loading/parsing your page content:

$html.LoadHtml($output)

# do stuff with output.
$html.DocumentNode.SelectNodes('//*/text()').Text.Where{$PSItem -like '*=*'}

Footnote

I made the assumption in the code you were executing from a script where $PSScriptRoot will be populated. If it's being run interactively, you can use the $pwd automatic variable instead (carry-over from *nix, print working directory). This code requires PSv5+.

Maximilian Burszley
  • 16,176
  • 3
  • 29
  • 54
  • Forgive me, your answer is excellent-but I can log into the website just fine from the script. I mean, I even see it change in the window to the new page. I just can't seem to pull the new data of that page – King of NES Sep 18 '18 at 14:32
  • @KingofNES I understand that, but using this method, you *can* pull information off the page *and* you remove your dependency on IE & com objects. – Maximilian Burszley Sep 18 '18 at 14:39
0

After some serious effort-I managed to get the pages to work correctly. Turns out I wasn't waiting for everything to load-but once I had that, I eventually found the correct tag/name to make everything work.

Assuming the code in the original post is correct up to "ie.Navigate($url)"

$ie.Navigate($url)

While ($ie.Busy -eq $true) {Start-Sleep -Seconds 3;}
$r = Invoke-WebRequest $url
$doc = $ie.document
$j = ($doc.getElementsByTagName("body") | Where {$_.className -eq 'thefullclassname found in the quotes of <body class="" of the area you wanted'}).innerText
write-host $j

This gave me the output of a very annoyingly done table that isn't a "table", and has the first row/col on it's own-so formatting the output to an easy to use version will be the new hassle. At least I got everything on the page that had the text I needed...so progress!

King of NES
  • 231
  • 1
  • 3
  • 13