Issue
I am trying to get the 3rd and 6th tables from a webpage (https://www.pro-football-reference.com/years/2021/) but pandas.read_html and BeautifulSoup are both only finding the first two tables on the page. Here is what I've tried.
url = 'https://www.pro-football-reference.com/years/2021/'
data_pd = pd.read_html(url)
print(len(data_pd))
Output:
2
and also
url = 'https://www.pro-football-reference.com/years/2021/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for table in soup.find_all('table'):
print(table.get('class'))
Output:
['sortable', 'stats_table']
['sortable', 'stats_table']
I am guessing it has something to do with the way the webpage is formatted, but is there anything I can do to grab the tables that I need?
Solution
Yes you could use Selenium to let the page render then pull in the html. However I try to avoid Selenium if I could as to avoid the overhead.
The better option though is through the simple request, the static html does have the other tables in there, but within the comments. You could do a) BeautifulSoup does have the ability to pull out the Comments to then parse those tables. Or simply remove the comment tags and then parse.
import requests
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/'
response = requests.get(url).text.replace("<!--","").replace("-->","")
data_pd = pd.read_html(response)
print(len(data_pd))
Output:
print(len(data_pd))
13
OR Using BEautifulSoup to co through the comments:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
data_pd = pd.read_html(url)
for each in comments:
if '<table' in str(each):
data_pd.append(pd.read_html(str(each))[0])
print(len(data_pd))
Answered By - chitown88
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.