replace characters not working in python

Question

I am using beautiful soup and I am writing a crawler and have the following code in it:

  print soup.originalEncoding
                #self.addtoindex(page, soup)


                links=soup('a')
            for link in links:

                if('href' in dict(link.attrs)):                   
                    link['href'].replace('..', '')
                    url=urljoin(page, link['href'])
                    if url.find("'") != -1:
                        continue
                    url = url.split('?')[0]
                    url = url.split('#')[0]
                    if url[0:4] == 'http':
                        newpages.add(url)
        pages = newpages

The link['href'].replace('..', '') is supposed to fix links that come out as ../contact/orderform.aspx, ../contact/requestconsult.aspx, etc. However, it is not working. Links still have the leading ".." Is there something I am missing?

score 58 · Accepted Answer · edited Feb 22 '13 at 18:47

58

string.replace() returns the string with the replaced values. It doesn't modify the original so do something like this:

link['href'] = link['href'].replace("..", "")

edited Feb 22 '13 at 18:47

mechanical_meat

155,494
24
217
209

answered Aug 26 '11 at 18:15

joel goldstick

4,140
6
27
46

jan zegan · Answer 2 · 2011-08-26T18:28:13.507

15

string.replace() returns a copy of the string with characters replaced, as strings in Python are immutable. Try

s = link['href'].replace("..", '')
url=urljoin(page, s)

edited Aug 26 '11 at 18:28

answered Aug 26 '11 at 18:21

jan zegan

1,599
1
10
18

score 11 · Answer 3 · answered Aug 26 '11 at 18:17

11

It is not an inplace replacement. You need to do:

link['href'] = link['href'].replace('..', '')

Example:

a = "abc.."
print a.replace("..","")
'abc'
 print a
'abc..'
a = a.replace("..","")
print a
'abc'

answered Aug 26 '11 at 18:17

Urjit

1,110
8
14

replace characters not working in python

3 Answers3

Linked

Related