Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

Question

I have the following:

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

And would like to get just the text of href which is /file-one/additional. So I did:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

Thank you in advance and will be sure to upvote/accept answer!

For that matter, why does your *code* have curly-quotes in it? What are you coding in? You need to use a text editor. — user2357112, May 05 '17 at 22:55
If you remove the parameter `text=True`, your code works for me — chickity china chinese chicken, May 05 '17 at 23:18
If you need more info on quotes, check this article out: https://blogs.msdn.microsoft.com/oldnewthing/20090225-00/?p=19033 — Kyle Falconer, May 05 '17 at 23:24
@downshift What does `text=True` do? Thought it returns in text form — , May 05 '17 at 23:49
@LyManeug, the `text` parameter expects a `string` type; from the [docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) "you can search for strings instead of tags"; the `text` parameter is used in earlier versions and has been changed to `string`; but yeah, I think using a `boolean` with `text` instead of a `string` is what was the problem. — chickity china chinese chicken, May 05 '17 at 23:55
Or maybe you were thinking of the passing `boolean` to the `soup.find_all` function as in [for tag in soup.find_all(True):`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#true) — chickity china chinese chicken, May 06 '17 at 00:00

t.m.adam · Answer 1 · 2019-03-28T08:35:11.457

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

Or you could use a list comprehension, if you prefer one-liners.

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

Or you could pass a lambda to .find_all().

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.

Using .find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

Using .select() with CSS selectors.

links = [a['href'] for a in soup.select('a[href]')]

Thought to inform you about a question that I find trouble figuring out myself. I'll be very glad if you give [this post](https://stackoverflow.com/questions/59594692/unable-to-use-https-proxy-within-urllib-request) a go. Thanks. — MITHU, Jan 05 '20 at 09:54

score 6 · Answer 2 · answered Jun 04 '18 at 11:52

6

You can also use attrs to get the href tag with regex search

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']

answered Jun 04 '18 at 11:52

Rakshit Vats

61
1
4

Do you know why calling directly `.href` does not work, but `.attrs['href']` works fine? I just spent 15 min to debug this :( – Jean Monet Dec 18 '20 at 22:23

score 5 · Answer 3 · answered May 05 '17 at 23:18

5

First of all, use a different text editor that doesn't use curly quotes.
Second, remove the text=True flag from the soup.find_all

answered May 05 '17 at 23:18

whackamadoodle3000

6,475
4
23
40

score 3 · Answer 4 · answered Oct 09 '20 at 22:57

You could solve this with just a couple lines of gazpacho:


from gazpacho import Soup

html = """\
<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>
"""

soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']

Which would output:

'/file-one/additional'

Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

4 Answers4

Linked

Related