Issue
scraping the Fbref website to get specific player info so that I can use that for further analysis.
I have selected the table I want to scrape. The information I want is in <tr>
tags without any class attributes.
But the issue is that this table has many headers in <tr>
tags that have a class name
import requests
from bs4 import BeautifulSoup
from time import sleep
url = "https://fbref.com/en/comps/9/2021-2022/stats/2021-2022-Premier-League-Stats"
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(response, "html.parser")
I have selected the desired table I want to scrape. I want to select <tr>
tags that don't have any class attribute because that's where the information I want is located.
players_table = soup.select("table#stats_standard tbody tr", class_ =None)
I have then looped through the players_table so that I can get each player's info like name, country, position, etc.
for player in players_table:
player_name = player.find("td", attrs={"data-stat" : "player"}).a.text
print(player_name)
sleep(2)
But now the problem is that my code will loop through the table and when it finds the <tr class="theads">
tag, it tries to look for its <a>
tag and then further look for the text in the <a>
tag. But this specific <tr class="theads">
tag doesn't have any <a>
tags and that makes my code to break and get this error message 'NoneType' object has no attribute 'a' when I try to run it.
My code prints the names of the players untill it finds this <tr class="theads">
tag with no <a>
then it just fails & breaks.
I have even tried to decompose or clear this <tr class="theads">
tag, but it still doesn't work.
player.find(".thead").decompose()
So my question is how can I select only tags that don't have any class so that when my reaches tag, it just neglects it. I have actually tried doing that by using class_ = None when making the table
players_table = soup.select("table#stats_standard tbody tr", class_ =None)
But this didn't solve anything. I need your help on this, please.
Solution
If you only wanna exclude the subheaders adjust your selector, that it only selects these <tr>
without class .thead
:
soup.select('table#stats_standard tbody tr:not(.thead)')
or more specific to the title of your question that do not have a class attribute:
soup.select('table#stats_standard tbody tr:not([class])')
Example
import requests
from bs4 import BeautifulSoup
from time import sleep
url = "https://fbref.com/en/comps/9/2021-2022/stats/2021-2022-Premier-League-Stats"
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(response)
for player in soup.select('table#stats_standard tbody tr:not([class])'):
player_name = player.find("td", attrs={"data-stat" : "player"}).a.text
print(player_name)
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.