Wednesday, April 6, 2022

[FIXED] Web scraping with python - table with mutliple tbody elements

April 06, 2022 beautifulsoup, pandas, python, web-scraping No comments

Issue

I'm trying to scrape the data from the top table on this page ("2021-2022 Regular Season Player Stats") using Python and BeautifulSoup. The page shows stats for 100 NHL players, 1 player per row. The code below works, but the problem is it only pulls the first ten rows into the dataframe. This is because the every ten rows is in a separate <tbody>, so it is only iterating through the rows in the first <tbody>. How can I get it to continue through the rest of the <tbody> elements on the page?

Another question: this table has about 1000 rows total, and only shows up to 100 per page. Is there a way to rewrite the code below to iterate through the entire table at once instead of just the 100 rows that show on the page?

    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    
    url = 'https://www.eliteprospects.com/league/nhl/stats/2021-2022'

    source = requests.get(url).text
    soup = BeautifulSoup(source,'html.parser')

    table = soup.find('table', class_='table table-striped table-sortable player-stats highlight-stats season')

    df = pd.DataFrame(columns=['Player', 'Team', 'GamesPlayed', 'Goals', 'Assists', 'TotalPoints', 'PointsPerGame', 'PIM', 'PM'])

    for row in table.tbody.find_all('tr'):
        columns = row.find_all('td')

        Player = columns[1].text.strip()
        Team = columns[2].text.strip()
        GamesPlayed = columns[3].text.strip()
        Goals = columns[4].text.strip()
        Assists = columns[5].text.strip()
        TotalPoints = columns[6].text.strip()
        PointsPerGame = columns[7].text.strip()
        PIM = columns[8].text.strip()
        PM = columns[9].text.strip()

        df = df.append({"Player": Player, "Team": Team, "GamesPlayed": GamesPlayed, "Goals": Goals, "Assists": Assists, "TotalPoints": TotalPoints, "PointsPerGame": PointsPerGame, "PIM": PIM, "PM": PM}, ignore_index=True)

Solution

To load all player stats into a dataframe and save it to csv you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


dfs = []
for page in range(1, 11):
    url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
    print(f"Loading {url=}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    df = (
        pd.read_html(str(soup.select_one(".player-stats")))[0]
        .dropna(how="all")
        .reset_index(drop=True)
    )
    dfs.append(df)

df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)

Prints:

...

1132  973.0             Austin Poganski (RW)          Winnipeg Jets    16     0     0     0  0.00      7  -3.0
1133  974.0             Mikhail Maltsev (LW)     Colorado Avalanche    18     0     0     0  0.00      2  -5.0
1134  975.0            Mason Geertsen (D/LW)      New Jersey Devils    23     0     0     0  0.00     62  -4.0
1135  976.0                  Jack McBain (C)        Arizona Coyotes     -     -     -     -     -      -   NaN
1136  977.0                Jordan Harris (D)     Montréal Canadiens     -     -     -     -     -      -   NaN
1137  978.0              Nikolai Knyzhov (D)        San Jose Sharks     -     -     -     -     -      -   NaN
1138  979.0              Marc McLaughlin (C)          Boston Bruins     -     -     -     -     -      -   NaN
1139  980.0                Carson Meyer (RW)  Columbus Blue Jackets     -     -     -     -     -      -   NaN
1140  981.0                 Leon Gawanke (D)          Winnipeg Jets     -     -     -     -     -      -   NaN
1141  982.0                 Brady Keeper (D)      Vancouver Canucks     -     -     -     -     -      -   NaN
1142  983.0                  Miles Wood (LW)      New Jersey Devils     -     -     -     -     -      -   NaN
1143  984.0              Samuel Morin (D/LW)    Philadelphia Flyers     -     -     -     -     -      -   NaN
1144  985.0               Connor Carrick (D)         Seattle Kraken     -     -     -     -     -      -   NaN
1145  986.0          Micheal Ferland (LW/RW)      Vancouver Canucks     -     -     -     -     -      -   NaN
1146  987.0                Jake Gardiner (D)    Carolina Hurricanes     -     -     -     -     -      -   NaN
1147  988.0                Oscar Klefbom (D)        Edmonton Oilers     -     -     -     -     -      -   NaN
1148  989.0                   Shea Weber (D)     Montréal Canadiens     -     -     -     -     -      -   NaN
1149  990.0            Brandon Sutter (C/RW)      Vancouver Canucks     -     -     -     -     -      -   NaN
1150  991.0               Brent Seabrook (D)    Tampa Bay Lightning     -     -     -     -     -      -   NaN

and saves data.csv (screenshot from LibreOffice):

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, April 6, 2022

[FIXED] Web scraping with python - table with mutliple tbody elements

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels