0

getting some data from Wordpress-forums requires login and parsing - two parts. Both work very well as a standalone part. i can login with selenium - perfectly - and i can parse (scrape) the data with BS4. But when i combine the two parts then i run into session issues - that i cannot solve.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
 
#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
#options.add_argument('--disable-gpu')
browser = webdriver.Chrome(executable_path=r'C:\chrome\chromedriver.exe', options=options)
#--| Parse or automation
browser.get("https://login.wordpress.org/?locale=en_US")
time.sleep(2)
user_name = browser.find_element_by_css_selector('#user_login')
user_name.send_keys("the username ")
password = browser.find_element_by_css_selector('#user_pass')
password.send_keys("the pass")
time.sleep(5)
submit = browser.find_elements_by_css_selector('#wp-submit')[0]
submit.click()
 
# Example send page source to BeautifulSoup or selenium for parse
soup = BeautifulSoup(browser.page_source, 'lxml')
use_bs4 = soup.find('title')
print(use_bs4.text)
#print('*' * 25)
#use_sel = browser.find_elements_by_css_selector('div > div._1vC4OE')
#print(use_sel[0].text)

note - this works perfect. you can check it with the following combination:

login: pluginfan
pass: testpasswd123

see below the parser&scraper with bs4 - that works outstanding.

#!/usr/bin/env python3
 
import requests
from bs4 import BeautifulSoup as BS
 
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'}) # this page needs header 'User-Agent`
 
url = 'https://wordpress.org/support/plugin/advanced-gutenberg/page/{}/'
 
for page in range(1, 3):
    print('\n--- PAGE:', page, '---\n')
 
    # read page with list of posts
    r = session.get(url.format(page))
 
    soup = BS(r.text, 'html.parser')
 
    all_uls = soup.find('li', class_="bbp-body").find_all('ul')
 
    for number, ul in enumerate(all_uls, 1):
 
        print('\n--- post:', number, '---\n')
 
        a = ul.find('a')
        if a:
            post_url = a['href']
            post_title = a.text
 
            print('text:', post_url)
            print('href:', post_title)
            print('---------')
 
            # read page with post content
            r = session.get(post_url)
 
            sub_soup = BS(r.text, 'html.parser')
 
            post_content = sub_soup.find(class_='bbp-topic-content').get_text(strip=True, separator='\n')
            print(post_content)

but the combination of both does not work: guess that i cannot create a new session with Requests,most work with the session that Selenium created i have some issues to run the parser with the login part

the stadalone parser gives back valid content - thats fine!

--- post: 1 ---
 
text: https://wordpress.org/support/topic/advanced-button-with-icon/
href: Advanced Button with Icon?
---------
is it not possible to create a button with a font awesome icon to left / right?
 
--- post: 2 ---
 
text: https://wordpress.org/support/topic/expand-collapse-block/
href: Expand / Collapse block?
---------
At the very bottom I have an expandable requirements.
Do you have a better block? I would like to use one of yours if poss.
The page I need help with:
 
--- post: 3 ---
 
text: https://wordpress.org/support/topic/login-form-not-formatting-correctly/
href: Login Form Not Formatting Correctly
---------
Getting some weird formatting with the email & password fields running on outside the form.
Tried on two different sites.
Thanks
 
..... [,,,,,] ....
 
--- post: 22 ---
 
text: https://wordpress.org/support/topic/settings-import-export-2/
href: Settings Import & Export
---------
Traceback (most recent call last):
  File "C:\Users\Kasper\Documents\_f_s_j\_mk_\_dev_\bs\____wp_forum_parser_without_login.py", line 43, in <module>
    print(post_content)
  File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f642' in position 95: character maps to <undefined>
[Finished in 14.129s]

Any ideas ?

zero
  • 943
  • 3
  • 17
  • 41
  • 1
    You should be able to log in using requests. Not sure if that would fix your problem, but it might make your life simpler. From the error, it looks like an encoding problem. Have you tried different encodings when you get the text from the soup? – manny Jun 26 '20 at 13:15
  • hello - many thanks for the hint - i am very happy. Two things come up to my mind. I have had encoding issues here in earlier times - so that can be a great hint. But - regarding the session - if i take requests also - does this not produce conflicts with the selenium session!? i allways thougth that this produces conflicts... btw- i have added login crecentials for accessing the wordpress-site ... update: the encoding thing: i guess youre right :see here https://stackoverflow.com/questions/60698567/use-beatiful-soup-in-scraping-multiple-websites/60698677#comment107642457_60698677 – zero Jun 26 '20 at 13:32
  • 1
    dear Manny - many thanks for the reply and for all your great and supportive hints i will take a closer look at the weekend. Now i hzave to leave the house - but i come back later the day. Again many many thanks - you have made my day!!! – zero Jun 26 '20 at 13:36
  • @manny - sure thing your idea regarding requests is pretty interesting but i guess that this might not help here since i need selenium for the captcha based-login process at wordpress. - so i guess that i finally run into conflicts - between selenium and requests. what do you think!? – zero Jun 29 '20 at 12:08
  • 1
    You could potentially log in with Selenium, then, once that's complete, pass the session cookie to your parser, add it to your session, and then parse that way. – manny Jun 30 '20 at 12:46
  • dear manny - great. that sounds great. I will have a closer look at this option!! btw: The parser-code above is yielding for conversations on wp-forums - which I would like to save in a CSV file. Well,.there are smart ways to have the "results" that contain `author: text: url: - if one is given in the thread.. etc.` well we can do this with the Requests library or the urllib, with Requests we can see how to do the CSV writing which is what I am interested in.saving in columns (`from A to D for example`) so that the values are stored in columns from A to D ( or so ) in the CSV?.Ideas? – zero Jul 02 '20 at 09:18
  • 1
    Not sure. I'd check out pandas. – manny Jul 02 '20 at 12:54
  • 1
    hi there - well pandas would be just great i have not very much experience with pandas but surely this would be it!!! Many thanks for the hint - youre just great! ;) – zero Jul 02 '20 at 15:30
  • 1
    if you use `Selenium` to login then use `Selenium` also to get other pages and forget `requests` – furas Jul 02 '20 at 23:56

1 Answers1

1

EDIT: In both versions I added saving in CSV file.


If you have Selenium and requests then there are three posibility

  • use Selenium to login and to get pages.
  • use requests.Session to login and to get pages.
  • use Selenium to login, get session information from Selenium and use them in requests

Using Selenium to login and to get pages is much simpler but it works slower then requests

It needs only to use

  • browser.get(url) instead of r = session.get(post_url)
  • BeautifulSoup(browser.page_source, ...) instead of BeautifulSoup(r.text, ...)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
import csv

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
#options.add_argument('--disable-gpu')
browser = webdriver.Chrome(executable_path=r'C:\chrome\chromedriver.exe', options=options)
#browser = webdriver.Firefox()

# --- login ---

browser.get("https://login.wordpress.org/?locale=en_US")
time.sleep(2)

user_name = browser.find_element_by_css_selector('#user_login')
user_name.send_keys("my_login")
password = browser.find_element_by_css_selector('#user_pass')
password.send_keys("my_password")
#time.sleep(5)
submit = browser.find_elements_by_css_selector('#wp-submit')[0]
submit.click()
 
# Example send page source to BeautifulSoup or selenium for parse
soup = BeautifulSoup(browser.page_source, 'lxml')
use_bs4 = soup.find('title')
print(use_bs4.text)
#print('*' * 25)
#use_sel = browser.find_elements_by_css_selector('div > div._1vC4OE')
#print(use_sel[0].text)

# --- pages ---

data = []

url = 'https://wordpress.org/support/plugin/advanced-gutenberg/page/{}/'
 
for page in range(1, 3):
    print('\n--- PAGE:', page, '---\n')
 
    # read page with list of posts
    browser.get(url.format(page))
    soup = BeautifulSoup(browser.page_source, 'html.parser') # 'lxml'
 
    all_uls = soup.find('li', class_="bbp-body").find_all('ul')
 
    for number, ul in enumerate(all_uls, 1):
 
        print('\n--- post:', number, '---\n')
 
        a = ul.find('a')
        if a:
            post_url = a['href']
            post_title = a.text
 
            print('href:', post_url)
            print('text:', post_title)
            print('---------')
 
            # read page with post content
            browser.get(post_url)
            sub_soup = BeautifulSoup(browser.page_source, 'html.parser')
 
            post_content = sub_soup.find(class_='bbp-topic-content').get_text(strip=True, separator='\n')
            print(post_content)

            # keep on list as dictionary
            data.append({
                'href': post_url,
                'text': post_title,
                'content': post_content,
            })
            
# --- save ---

with open("wp-forum-conversations.csv", "w") as f:
    writer = csv.DictWriter(f, ["text", "href", "content"])
    writer.writeheader()
    writer.writerows(data)  # all rows at once

EDIT:

requests works much faster but it needs more work with DevTools in Firefox/Chrome to see all fields in form and what other values it sends to server. It needs also to see where it is redirect when logging is correct. BTW: don't forget to turn off JavaScript before using DevTools because requests doesn't run JavaScript and page may sends different values in form. (and it really sends different fields)

It needs full User-Agent to work correctly.

First I load login page and copy all values from <input> to send them with loginand password

After login I check if it was redirected to different page - to confirm that it was logged correctly. You can also check if page display your name.

import requests
from bs4 import BeautifulSoup
import csv

s = requests.Session()
s.headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0' # it needs full user-agent
})

# --- get page with login form ---

r = s.get("https://login.wordpress.org/?locale=en_US")
soup = BeautifulSoup(r.text, 'html.parser')

# get all fields in form

payload = {}

for field in soup.find_all('input'):
    name = field['name']
    value = field['value']
    payload[name] = value
    print(name, '=', value)

# --- login ---

payload['log'] = 'my_login'
payload['pwd'] = 'my_password'

r = s.post('https://login.wordpress.org/wp-login.php', data=payload)
print('redirected to:', r.url)

# --- check if logged in ---

# check if logged in - check if redirected to different page
if r.url.startswith('https://login.wordpress.org/wp-login.php'):
    print('Problem to login')
    exit()

# check if logged in - check displayed name
url = 'https://wordpress.org/support/plugin/advanced-gutenberg/page/1/'
r = s.get(url)

soup = BeautifulSoup(r.text, 'html.parser')
name = soup.find('span', {'class': 'display-name'})
if not name:
    print('Problem to login')
    exit()
else:    
    print('name:', name.text)
    
# --- pages ---

data = []

url = 'https://wordpress.org/support/plugin/advanced-gutenberg/page/{}/'
 
for page in range(1, 3):
    print('\n--- PAGE:', page, '---\n')
 
    # read page with list of posts
    r = s.get(url.format(page))
    soup = BeautifulSoup(r.text, 'html.parser') # 'lxml'
 
    all_uls = soup.find('li', class_="bbp-body").find_all('ul')
 
    for number, ul in enumerate(all_uls, 1):
 
        print('\n--- post:', number, '---\n')
 
        a = ul.find('a')
        if a:
            post_url = a['href']
            post_title = a.text
 
            print('href:', post_url)
            print('text:', post_title)
            print('---------')
 
            # read page with post content
            r = s.get(post_url)
            sub_soup = BeautifulSoup(r.text, 'html.parser')
 
            post_content = sub_soup.find(class_='bbp-topic-content').get_text(strip=True, separator='\n')
            print(post_content)

            # keep on list as dictionary
            data.append({
                'href': post_url,
                'text': post_title,
                'content': post_content,
            })
            
# --- save ---

with open("wp-forum-conversations.csv", "w") as f:
    writer = csv.DictWriter(f, ["text", "href", "content"])
    writer.writeheader()
    writer.writerows(data)  # all rows at once
furas
  • 119,752
  • 10
  • 94
  • 135
  • 1
    I added code which uses `requests` to login and to get pages. – furas Jul 03 '20 at 00:58
  • There are heroes in this world. Thank you so much: -btw: if i want to gather even more data from the conversation like author etc. - this would be easily possible to extend. And if i would output this in csv i could use the csv-module or panda!? - see csv: `print(" post_content ") page_soup = get_soup_from_url(",,,,,") with open("wp-forum-conversations.csv", "w") as f: writer = csv.DictWriter(f, ["text", "href", "... and so on and so forth"]) writer.writeheader() for entry in post_content (page_soup): print( post_content ) writer.writerow(entry)` . – zero Jul 03 '20 at 09:31
  • 1
    in both versions I added saving in CSV file. – furas Jul 03 '20 at 13:26
  • Many thanks - this is just awesome and overwhelming. !! Many thanks again! – zero Jul 03 '20 at 13:32