Saturday, June 4, 2022

[FIXED] Web scrape attributes that are not always included in the tag Python Beautifulsoup

June 04, 2022 beautifulsoup, python No comments

Issue

I am trying to scrape a URL 'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm' using Beautiful Soup. I want to scrape the players name, their injury and the week of their injury

The players name is straight forward to scrape as it is text in a certain tag <th> and is always included in the tag. The week is an attribute ["data-stat"] of the tag <td> and is also always included in the tag. The injury is also an attribute ["data-tip"] of the same tag week is <td>, but it is only included in the tag when the player has an injury.

I tried using an if else statement for the injury status, so if the <td> tag contained an injury it would print the injury ["data-tip"] and if not it would simply print "NA". From the code I wrote, it prints the first two players' names, injury and the week of the injury but the third player does not contain the injury attribute ["data-tip"] in the <td> tag and the code would break and just print the first two players:

[['Danny Amendola'], 'Questionable: hamstring', 'week_1']
[['Armond Armstead'], 'Out: infection', 'week_1']

Outcome of my code! Experiencing a KeyError.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

containers = page_soup.find("tbody")

player = containers.find_all("tr")
for tr in player:
    th = tr.find_all("th")
    name = [i.text for i in th]

    week = tr.td["data-stat"]

    injury = tr.td["data-tip"]
    if injury is None:
        injury = "NA"
        print([name, injury, week])
    else:
        print([name, injury, week])

The outcome I am looking for is for the code to print the player names, injury (if no injury print "NA") and the week of injury for all the players in the table. For example, the third player in the table does not have an injury for week 1, therefore his injury should print "NA":

[['Danny Amendola'], 'Questionable: hamstring', 'week_1']
[['Armond Armstead'], 'Out: infection', 'week_1']
[['Kyle Arrington'], 'NA', 'week_1']

The list should continue on for the rest of the players like this.

Solution

I'm piggy backing on Jack Moody's solution (just adding the additional weeks), but here's the additional data/columns:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

containers = page_soup.find("tbody")
head = page_soup.find("thead")


player = containers.find_all("tr")

weeks = head.find_all('th')
week_list = [i['data-stat'] for i in weeks][1:]

for week in week_list:
    for tr in player:
        th = tr.find_all("th")
        name = [i.text for i in th]
        
        td = tr.find('td', {'data-stat':week})
        week = td["data-stat"]
    
        try:
            injury = td["data-tip"]
            print([name, injury, week])
        except KeyError:
            injury = "NA"
            print([name, injury, week])

Answered By - chitown88

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, June 4, 2022

[FIXED] Web scrape attributes that are not always included in the tag Python Beautifulsoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels