21

I have the following:

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

And would like to get just the text of href which is /file-one/additional. So I did:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

Thank you in advance and will be sure to upvote/accept answer!

4 Answers4

41

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

Or you could use a list comprehension, if you prefer one-liners.

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

Or you could pass a lambda to .find_all().

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.

Using .find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

Using .select() with CSS selectors.

links = [a['href'] for a in soup.select('a[href]')]
t.m.adam
  • 14,756
  • 3
  • 29
  • 50
  • Thought to inform you about a question that I find trouble figuring out myself. I'll be very glad if you give [this post](https://stackoverflow.com/questions/59594692/unable-to-use-https-proxy-within-urllib-request) a go. Thanks. – MITHU Jan 05 '20 at 09:54
6

You can also use attrs to get the href tag with regex search

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']
Rakshit Vats
  • 61
  • 1
  • 4
  • Do you know why calling directly `.href` does not work, but `.attrs['href']` works fine? I just spent 15 min to debug this :( – Jean Monet Dec 18 '20 at 22:23
5
  1. First of all, use a different text editor that doesn't use curly quotes.

  2. Second, remove the text=True flag from the soup.find_all

whackamadoodle3000
  • 6,475
  • 4
  • 23
  • 40
3

You could solve this with just a couple lines of gazpacho:


from gazpacho import Soup

html = """\
<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>
"""

soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']

Which would output:

'/file-one/additional'
emehex
  • 8,531
  • 9
  • 52
  • 91