-2

I am learning to scrape websites. I need to get document titles and links to them, I already manage to do this, but the format of the resulting links is sometimes not what I need. Here is a snippet of the information I get:

['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm


['Численность мужчин и женщин', '/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
/storage/mediabank/yKsfiyjR/demo13.xls

You can see that in the second case I get only part of the link, while in the first I get the whole link. To the format of the second link, I need to add a part of the text that I know in advance. But this must be done on the basis of the condition that the format of this link will be defined. That is, at the output, I want to receive the following:

['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm


['Численность мужчин и женщин', 'https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls

How should I do it? Here is the previously reproduced code:

import requests
from bs4 import BeautifulSoup

URL = "https://rosstat.gov.ru/folder/12781"

responce = requests.get(URL).text
soup = BeautifulSoup(responce, 'lxml')
block = soup.find('div', class_="col-lg-8 order-1 order-lg-1")

list_info_block_row = block.find_all('div', class_='document-list__item document-list__item--row')
list_info_block_col = block.find_all('div', class_='document-list__item document-list__item--col')

sources = []

for text_block_row in list_info_block_row:
    new_list = []
    title_element_row = text_block_row.find('div', class_='document-list__item-title')
    preprocessing_title = title_element_row.text.strip()
    link_element_row = text_block_row.find('a').get('href')
    new_list.append(preprocessing_title)
    new_list.append(link_element_row)
    print(new_list)
    print(title_element_row.text.strip())
    print(link_element_row)
    print('\n\n')

martineau
  • 112,593
  • 23
  • 157
  • 280
kostya ivanov
  • 439
  • 1
  • 11
  • Please take the [tour], read [what's on-topic here](/help/on-topic), [ask], and the [question checklist](//meta.stackoverflow.com/q/260648/843953), and provide a [mre]. "Implement this feature for me" is off-topic for this site because SO isn't a free online coding service. You have to _make an honest attempt_, and then ask a _specific question_ about your algorithm or technique. In this case, have you identified the pattern you need to match to tell whether you need to append or not? Where is your code to match this pattern? – Pranav Hosangadi Jan 31 '22 at 15:34

1 Answers1

1

You can check if the string has an scheme, and if not add it and also the host:

if not link_element_row.startswith("http"):
        parsed_url = urlparse(URL)
        link_element_row = (
            parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
        )

Full working code:

import requests
from urllib.parse import urlparse
from bs4 import BeautifulSoup

URL = "https://rosstat.gov.ru/folder/12781"

responce = requests.get(URL).text
soup = BeautifulSoup(responce, "lxml")
block = soup.find("div", class_="col-lg-8 order-1 order-lg-1")

list_info_block_row = block.find_all(
    "div", class_="document-list__item document-list__item--row"
)
list_info_block_col = block.find_all(
    "div", class_="document-list__item document-list__item--col"
)

for text_block_row in list_info_block_row:
    new_list = []
    title_element_row = text_block_row.find("div", class_="document-list__item-title")
    preprocessing_title = title_element_row.text.strip()
    link_element_row = text_block_row.find("a").get("href")
    new_list.append(preprocessing_title)

    if not link_element_row.startswith("http"):
        parsed_url = urlparse(URL)
        link_element_row = (
            parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
        )

    new_list.append(link_element_row)

    print(new_list)
    print(title_element_row.text.strip())
    print(link_element_row)
    print("\n\n")

Research:

Eliaz Bobadilla
  • 199
  • 1
  • 13