0

How am I able to browse to another page within a website to extract data of it without having to authenticate again? using python and urllib2

see code below where I open the first page http://xx.xx.xx.xx:8080/status and get what i need from after authenticating, then i attempt to open second page http://xx.xx.xx.xx:8080/uistatus.html but jumps to the exception clause.

Unexpected error HTTP Error 401: Unauthorized

code:

try:

        pattern = r'\s*Current\s+stream\s+number:\s*(\d+)'
        pattern2 = r'\s*Reconnects:\s*(\d+)'
        SERVER = 'http://xx.xx.xx.xx:8080/status'
        authinfo = urllib2.HTTPPasswordMgrWithDefaultRealm()
        authinfo.add_password(None, SERVER, 'xxxxxx', 'xxxxxxx')
        page = 'http://xx.xx.xx.xx:8080/status'
        handler = urllib2.HTTPBasicAuthHandler(authinfo)
        myopener = urllib2.build_opener(handler)
        opened = urllib2.install_opener(myopener)
        output = urllib2.urlopen(page)
        #print output.read()
        soup = BeautifulSoup(output.read(), "lxml")
        #print(soup)

        paragraphs = soup.findAll('p')
        data = []
        for para in paragraphs:
                found = re.finditer(pattern, para.text, re.IGNORECASE);

                data.extend([x.group(1) for x in found])


        #print data
        print "exstreamer 1 status: ", int(data[0])
        if int(data[0]) == 1:
                mesg = "Centerpoint exstreamer connected to main streaming host"
                centerpoint_online = "Online"
                centerpoint_connection = "Main"    

        elif int(data[0]) == 2 or int(data[0]) == 3:
                mesg = "Centerpoint exstreamer connected to local qkradio instreamer"
                print 'alert sent', mesg
                with open("/var/www/html/status.log", "a") as myfile:
                    myfile.write(time.strftime("%Y-%m-%d %H:%M")+ "\t Centerpoint exstreamer connected to local qkradio instreamer\n")  
                centerpoint_connection = "Backup"
                system_ok = "Offline"
        data = []
        for para in paragraphs:
                found = re.finditer(pattern2, para.text, re.IGNORECASE);

                data.extend([x.group(1) for x in found])
        centerpoint_reconnect_number_old = centerpoint_reconnect_number
        centerpoint_reconnect_number = int(data[0])
        print "Centerpoint number of reconnects: ", centerpoint_reconnect_number
        if not centerpoint_reconnect_number == centerpoint_reconnect_number_old:
            centerpoint_stream_stable = "Disconnected/ Reconnected to Stream"
            mesg = "Centerpoint exstreamer disconnect/reconnect, possible buffering issues"
            print 'alert sent', mesg
            with open("/var/www/html/status.log", "a") as myfile:
                    myfile.write(time.strftime("%Y-%m-%d %H:%M")+ "\t Centerpoint exstreamer disconnect/reconnect, possible buffering issues\n") 
        else:
            centerpoint_stream_stable = "system ok" 


        page = 'http://xx.xx.xx.xx:8080/uistatus.html'
        output = urllib2.urlopen(page)
        htmlparser = etree.HTMLParser()
        tree = etree.parse(output, htmlparser)
        #print tree.xpath("/html/body/table/tr[3]/th[2]/font/text()")
        print tree.xpath("//th/font[@color]/text()")
        centerpoint_stream_status = tree.xpath("//th/font[@color]/text()")

        if centerpoint_stream_status is "['IDLE']":
            mesg = "Centerpoint exstreamer source IDLE"
            print 'alert sent', mesg
            with open("/var/www/html/status.log", "a") as myfile:
                    myfile.write(time.strftime("%Y-%m-%d %H:%M")+ "\t Centerpoint exstreamer source IDLE\n") 



except urllib2.URLError:
        print "Internet dropped, or error"
        mesg = "Centerpoint exstreamer unreachable"
        print 'alert sent', mesg
        i_centerpoint_online = centerpoint_online + 1
        if centerpoint_online == 3:
            centerpoint_online = 0
            with open("/var/www/html/status.log", "a") as myfile:
                        myfile.write(time.strftime("%Y-%m-%d %H:%M")+ "\t Centerpoint exstreamer unreachable\n")  
            centerpoint_online = "Offline"
            system_ok = "Offline"

except Exception, err:
    print "Unexpected error", err
    centerpoint_online = "Offline"
    system_ok = "Offline"
Ossama
  • 2,333
  • 5
  • 41
  • 75
  • The answer depends on the authentication scheme of the site at which you're looking. Often, auth cookies are sent when a user authenticates, but there's no way to answer this question without knowing how the user is meant to provide their credentials after authentication. Your most recent edit indicates that you are not sending auth credentials (which would have been returned with the first response) in the second request. – Matt Morgan Jan 10 '18 at 19:59
  • how can i get you this information please – Ossama Jan 10 '18 at 20:03
  • i get the following error Unexpected error HTTP Error 401: Unauthorized – Ossama Jan 10 '18 at 20:04
  • Do you have documentation on the website you're scraping? That would be one place to look. The server may be sending back an auth token/cookie, so you could also inspect the properties of the initial response. A debugger would be ideal for this task.https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems – Matt Morgan Jan 10 '18 at 20:06

0 Answers0