Issue
I am trying to scrape some rugby statistics from pages that all look like this one (one per player): https://www.unitedrugby.com/clubs/benetton/filippo-alongi
This is just an example one.
First I set up a driver with selenium and then pass the content to BeautifulSoup for html exploration.
url = "https://www.unitedrugby.com/clubs/benetton/filippo-alongi"
driver = webdriver.Chrome( options=chrome_options)
driver.get(url)
soup = driver.page_source
soup = BeautifulSoup(soup, 'html.parser')
driver.quit()
At this point, I want to fetch the following class: player-hero__info-wrap
. I do that with find_all()
, which can find most things but not all of them.
By clicking on the link I provided, and inspecting the weight value (118KG) you will land very near to this tag in the inspector, so you can see that it exists.
However, when scraping it, I can't see it. I am using selenium because this page seems like it needs to be rendered with javascript before reading it, but I still can't see all classes.
I tried adding the following lines to execute javascript:
driver.execute_script("return document.documentElement.outerHTML;")
or even:
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
But nothing.
Can anybody help me fetch this class?
Solution
This is one way to obtain that info, with selenium only (why parse the page twice?):
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(browser, 20)
url = 'https://www.unitedrugby.com/clubs/benetton/filippo-alongi'
browser.get(url)
try:
wait.until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
print('accepted cookies')
except Exception as e:
print('no cookie button!')
player_stats = wait.until (EC.element_to_be_clickable((By.CSS_SELECTOR, 'div[class="player-hero__info-wrap"]')))
print(player_stats.text)
### do other stuff, get other info, etc etc ###
browser.quit()
This will click away the annoying cookie popup (probably unnecessary in your scenario, but just in case you will try to interact with page), and print in terminal:
accepted cookies
AGE
22
HEIGHT
6'0''
WEIGHT
118KG
You don't really need BeautifulSoup when using Selenium, as it has powerful locators and finding methods. For documentation, please visit https://www.selenium.dev/documentation/
EDIT: And here is another solution based on requests/BeautifulSoup:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.unitedrugby.com/clubs/benetton/filippo-alongi'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
player_data = soup.select_one('div.player-hero__info-wrap')
print(player_data.text.strip())
Result:
Age
22
Height
6'0''
Weight
118KG
Relevant documentation: https://beautiful-soup-4.readthedocs.io/en/latest/index.html for BeautifulSoup and for requests: https://requests.readthedocs.io/en/latest/
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.