Issue
What I'd like to do is to scrape a Clash of Clans players profile site from clashofstats.com for instance: https://www.clashofstats.com/players/captain-morgan-L9YJUPY22/history/log
to get an approximation of the last and first time played based on the logged clan activity, first of which I have already implemented with the following:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.clashofstats.com/players/captain-morgan
L9YJUPY22/history/log')
soup = BeautifulSoup(response.content, 'html.parser')
end_dates = soup.find_all(class_="end date")
last_played = str(end_dates[0]).replace("<span class=\"end date\">", "")
last_played = last_played.replace(",", "")
last_played = last_played.replace("</span>", "")
print(f"Last time played| {last_played}")
Output:
Last time played| Sep 30 2022
(weird format is to match the rest of the code)
Now back to my question, the problem comes with the first time played. clashofstats has multiple pages of logged clans, but when going to the last page (where the first date is) the url doesn't change and nor does the source code. I can only see changes from the dev Tools, but how can I direct, preferably using BeautifulSoup, to that last page and get the date?, if that is even possible.
Solution
Since the 'next' and page button are not hyperlinked and the site doesn't seem to be loading the data via any easy-to-find APIs, I expect it would take a horribly convoluted process of requesting and parsing scripts to retrieve this one date from the last page.
Instead, you could use selenium to click the last page button [(2
) in this case] and then get the last date. (If you haven't ever used selenium before, I found this to a be a very helpful starting point.)
from selenium import webdriver
from selenium.webdriver.common.by import By
chromeDriver_path = 'chromedriver.exe'
# I just copied the exe file to the same folder as this py file
driver = webdriver.Chrome(chromeDriver_path)
tag = "L9YJUPY22" # your player tag without '#'
driver.get('https://www.clashofstats.com/players/' + tag + '/history/log')
start_dates = driver.find_elements(By.CSS_SELECTOR, 'span.start.date')
end_dates = driver.find_elements(By.CSS_SELECTOR, 'span.end.date')
page_btns = driver.find_elements(By.CSS_SELECTOR, 'button.v-pagination__item')
try:
clan = driver.find_element(By.CSS_SELECTOR, 'div.v-list-item__title.text--secondary.font-italic').get_attribute(
'innerText')
except:
clan = "something else"
if start_dates[0].get_attribute('innerText') == end_dates[0].get_attribute('innerText') or clan == "Not in any Clans":
# basically the inactivity time can't really be seen from just the clan, so it only works on
# players that don't have a clan. this is sometimes displayed weird by the website, thus this logic
last_played = start_dates[1].get_attribute('innerText')
# only start date is available when turning inactive
else:
last_played = end_dates[0].get_attribute('innerText') # in a normal case this is "today"
# see if there are more pages
if len(page_btns) > 0 and len(end_dates) > 0:
page_btns[-1].click() # click the last page_btn
start_dates = driver.find_elements(By.CSS_SELECTOR, 'span.start.date') # update start dates
first_played = start_dates[-1].get_attribute('innerText') # get last start date
print(f"First time played| {first_played}")
print(f"Last time played| {last_played}")
driver.close() # else the window stays open and your program keeps running
Output:
First time played| Sep 8, 2020
Last time played| Oct 1, 2022
This should work standalone for any number of pages.
Answered By - Driftr95
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.