Issue
I'm trying to scrape the data from the top table on this page ("2021-2022 Regular Season Player Stats") using Python and BeautifulSoup. The page shows stats for 100 NHL players, 1 player per row. The code below works, but the problem is it only pulls the first ten rows into the dataframe. This is because the every ten rows is in a separate <tbody>
, so it is only iterating through the rows in the first <tbody>
. How can I get it to continue through the rest of the <tbody>
elements on the page?
Another question: this table has about 1000 rows total, and only shows up to 100 per page. Is there a way to rewrite the code below to iterate through the entire table at once instead of just the 100 rows that show on the page?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.eliteprospects.com/league/nhl/stats/2021-2022'
source = requests.get(url).text
soup = BeautifulSoup(source,'html.parser')
table = soup.find('table', class_='table table-striped table-sortable player-stats highlight-stats season')
df = pd.DataFrame(columns=['Player', 'Team', 'GamesPlayed', 'Goals', 'Assists', 'TotalPoints', 'PointsPerGame', 'PIM', 'PM'])
for row in table.tbody.find_all('tr'):
columns = row.find_all('td')
Player = columns[1].text.strip()
Team = columns[2].text.strip()
GamesPlayed = columns[3].text.strip()
Goals = columns[4].text.strip()
Assists = columns[5].text.strip()
TotalPoints = columns[6].text.strip()
PointsPerGame = columns[7].text.strip()
PIM = columns[8].text.strip()
PM = columns[9].text.strip()
df = df.append({"Player": Player, "Team": Team, "GamesPlayed": GamesPlayed, "Goals": Goals, "Assists": Assists, "TotalPoints": TotalPoints, "PointsPerGame": PointsPerGame, "PIM": PIM, "PM": PM}, ignore_index=True)
Solution
To load all player stats into a dataframe and save it to csv you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1, 11):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".player-stats")))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)
Prints:
...
1132 973.0 Austin Poganski (RW) Winnipeg Jets 16 0 0 0 0.00 7 -3.0
1133 974.0 Mikhail Maltsev (LW) Colorado Avalanche 18 0 0 0 0.00 2 -5.0
1134 975.0 Mason Geertsen (D/LW) New Jersey Devils 23 0 0 0 0.00 62 -4.0
1135 976.0 Jack McBain (C) Arizona Coyotes - - - - - - NaN
1136 977.0 Jordan Harris (D) Montréal Canadiens - - - - - - NaN
1137 978.0 Nikolai Knyzhov (D) San Jose Sharks - - - - - - NaN
1138 979.0 Marc McLaughlin (C) Boston Bruins - - - - - - NaN
1139 980.0 Carson Meyer (RW) Columbus Blue Jackets - - - - - - NaN
1140 981.0 Leon Gawanke (D) Winnipeg Jets - - - - - - NaN
1141 982.0 Brady Keeper (D) Vancouver Canucks - - - - - - NaN
1142 983.0 Miles Wood (LW) New Jersey Devils - - - - - - NaN
1143 984.0 Samuel Morin (D/LW) Philadelphia Flyers - - - - - - NaN
1144 985.0 Connor Carrick (D) Seattle Kraken - - - - - - NaN
1145 986.0 Micheal Ferland (LW/RW) Vancouver Canucks - - - - - - NaN
1146 987.0 Jake Gardiner (D) Carolina Hurricanes - - - - - - NaN
1147 988.0 Oscar Klefbom (D) Edmonton Oilers - - - - - - NaN
1148 989.0 Shea Weber (D) Montréal Canadiens - - - - - - NaN
1149 990.0 Brandon Sutter (C/RW) Vancouver Canucks - - - - - - NaN
1150 991.0 Brent Seabrook (D) Tampa Bay Lightning - - - - - - NaN
and saves data.csv
(screenshot from LibreOffice):
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.