0

I am writing a small program to fetch all hyperlinks from a webpage by providing a URL, but it seem like the network I am in is using proxy and it is not able to fetch .. My code:

import sys
import urllib
import urlparse

from bs4 import BeautifulSoup
def process(url):
    page = urllib.urlopen(url) 
    text = page.read()
    page.close()
    soup = BeautifulSoup(text) 
    with open('s.txt','w') as file:
        for tag in soup.findAll('a', href=True):
            tag['href'] = urlparse.urljoin(url, tag['href'])
            print tag['href']
            file.write('\n')
            file.write(tag['href'])


def main():
    if len(sys.argv) == 1:
        print 'No url !!'
        sys.exit(1)
    for url in sys.argv[1:]:
        process(url)
blueteeth
  • 3,110
  • 1
  • 11
  • 21
  • Based on your question your network may or may not have a proxy in use. Can you be a little more specific or just pass by your admins and ask? – frlan Sep 22 '15 at 08:59
  • yes , it have a proxy ,i tried at home it was working fine but when i took it to my Department to show to my teacher it dint work ...this is the error `IOError: [Errno socket error] [Errno -2] Name or service not known` – Shailang Kharsati Sep 22 '15 at 11:05
  • this is the proxy i used too connect "proxy4.nehu.ac.in:3128" how do i put it in codes in my program ..? please help , i am so stuck with it . – Shailang Kharsati Sep 22 '15 at 11:22
  • ok i will check on this and i will come back to you if i encounter some problem ..at this moment i cannot test it because i have to try it at the University itself since i dont have proxy network to test . If it ok with you? – Shailang Kharsati Sep 22 '15 at 11:56
  • you can easily set up a proxy on your own. E.g. squid is quiet popular. – frlan Sep 22 '15 at 11:57
  • OK i dint know ...thanks i will surely take a look ..thanks for you time ..cheers – Shailang Kharsati Sep 22 '15 at 12:01

2 Answers2

3

You could use the requests module instead.

import requests

proxies = { 'http': 'http://host/' } 
# or if it requires authentication 'http://user:pass@host/' instead

r = requests.get(url, proxies=proxies)
text = r.text
blueteeth
  • 3,110
  • 1
  • 11
  • 21
  • should i put it this way `proxies = { 'http': 'http://proxya4.nehu.ac.in }` – Shailang Kharsati Sep 22 '15 at 11:52
  • You need the port and closing quote. So it would be `proxies = { 'http': 'http://proxya4.nehu.ac.in:3128' }` – blueteeth Sep 22 '15 at 11:59
  • Can i come back to you later i will try first an let u know how it goes?..i really want this to work ..im like crying inside so bad. – Shailang Kharsati Sep 22 '15 at 12:03
  • Hi, i tried your suggestion i got 'response 200' when i print `r=requests.get("http://www.dota2.com",proxies=poxies)` what does it means. – Shailang Kharsati Sep 23 '15 at 09:04
  • 200 is the status code for the response. It is saying the response was ok. [1] To get the html from the page, you need to print `r.text`. [1]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html – blueteeth Sep 29 '15 at 21:45
  • Its working Thanks alot ...sorry for late reply – Shailang Kharsati Sep 30 '15 at 05:45
  • I thought this was working for me, but tried putting random information passed in with proxies and data was still retrieved each time (as long as https was used). – Phillip Aug 31 '16 at 20:51
1

The urllib library you are using for HTTP access does not support proxy authentication (it does support un-authenticated proxies). From the docs:

Proxies which require authentication for use are not currently supported; this is considered an implementation limitation.

I suggest you switch to urllib2 and use it as demonstrated in the answer to this post.

Community
  • 1
  • 1
shevron
  • 3,115
  • 2
  • 20
  • 32
  • I am new to python so its hard for me to implement , just for the head start can u like somehow show me how should i put it in my program ..? – Shailang Kharsati Sep 22 '15 at 11:12
  • i have read in the python documentation that there is a proxyHandler in urllib2 that can handle proxy , how to i put it in such a way that it will go through the proxy i used to connect to the internet.Please help – Shailang Kharsati Sep 22 '15 at 11:30