How to get only "href" with BeautifulSoup4?

Question

I am trying to get only the link from the result of find_all()

Here is my code:

    mydivs = soup.find_all("td", {"class": "candidates"})
    for link in mydivs:
        print(link)

But it returns:

<td class="candidates"><div><a data-tn-element="view-unread-candidates" data-tn-link="true" href="/c#candidates?id=a722443b402&amp;ctx=jobs-tab-view-candidates">56 candidates</a><br/><a data-tn-element="view-unread-candidates" data-tn-link="true" href="/c#candidates?id=a7b2a139b402&amp;candidateFilter=4af15d8991a8"><span class="jobs-u-font--bold">(45 awaiting review)</span></a></div></td>

What I want to get:

/c#candidates?id=a722443b402&ctx=jobs-tab-view-candidates

do you want to include the href or no? And is this converted to a string already or no? Not really an MCVE to be honest. — Edeki Okoh, May 10 '19 at 18:12
hey ! I just want to get that `/c#candidates?id=a7b2a139b402&candidateFilter=4af15d8991a8` — Solal, May 10 '19 at 18:13
@daka I am going through the post you sent. I am trying `link.href` but it returns `None`. My value `link` is a `` and it contains an href. Can you please advise ? — Solal, May 10 '19 at 18:30
You need to **find** the `a` element inside before you try to access the `href` attribute. — daka, May 10 '19 at 18:36
@EdekiOkoh I did try your solution but `link.text` returns only the text associated to the tag. It didn't work — Solal, May 10 '19 at 18:51
Try now. Rather than returning the text we convert the entire BS4 link into a string. — Edeki Okoh, May 10 '19 at 18:52

Edeki Okoh · Answer 1 · 2019-05-10T18:52:29.273

You can use regex to parse everything between the href and the last quotation mark after converting the bs4 element into a string.

import re

#Rest of imports/code up until your script. 

mydivs = soup.find_all("td", {"class": "candidates"})
or link in mydivs:
   link_text = str(link)
   href_link = re.search('href = "(.+?)"', link_text)
   print(href_link.group(1))

Small Example Shown Below:

import re

link_text = '<td class = "candidates" > <div > <a data-tn-element = "view-unread-candidates" data-tn-link = "true" href = "/c#candidates?id=a722443b402&amp;ctx=jobs-tab-view-candidates" > 56 candidates < /a > <br/> < a data-tn-element = "view-unread-candidates" data-tn-link = "true" href = "/c#candidates?id=a7b2a139b402&amp;candidateFilter=4af15d8991a8" > <span class = "jobs-u-font--bold" > (45 awaiting review) < /span > </a > </div > </td >'
href_link = re.search('href = "(.+?)"', link_text)
print(href_link.group(1))

Output:

/c#candidates?id=a722443b402&amp;ctx=jobs-tab-view-candidates

You may need to work on the spacing with the href = " inside of the re.search since I cannot see what the tag looks like. But all you need to do is copy the exact text from the href up until the first character of the link you want for this to work.

No, because it's unnecessarily complicated, which makes it a bad answer, and worthy of a down vote. — daka, May 10 '19 at 18:23
Seeing how the user tried the post you marked as duplicate and it returned None I would not called this overly complicated but rather a solution that works. — Edeki Okoh, May 10 '19 at 18:32

How to get only "href" with BeautifulSoup4?

1 Answers1