Monday, September 12, 2022

[FIXED] Neither pandas.read_html nor BeautifulSoup can find all tables on webpage

September 12, 2022 beautifulsoup, pandas, python No comments

Issue

I am trying to get the 3rd and 6th tables from a webpage (https://www.pro-football-reference.com/years/2021/) but pandas.read_html and BeautifulSoup are both only finding the first two tables on the page. Here is what I've tried.

url = 'https://www.pro-football-reference.com/years/2021/'

data_pd = pd.read_html(url)
print(len(data_pd))

Output:

and also

url = 'https://www.pro-football-reference.com/years/2021/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for table in soup.find_all('table'):
    print(table.get('class'))

Output:

['sortable', 'stats_table']
['sortable', 'stats_table']

I am guessing it has something to do with the way the webpage is formatted, but is there anything I can do to grab the tables that I need?

Solution

Yes you could use Selenium to let the page render then pull in the html. However I try to avoid Selenium if I could as to avoid the overhead.

The better option though is through the simple request, the static html does have the other tables in there, but within the comments. You could do a) BeautifulSoup does have the ability to pull out the Comments to then parse those tables. Or simply remove the comment tags and then parse.

import requests
import pandas as pd

url = 'https://www.pro-football-reference.com/years/2021/'
response = requests.get(url).text.replace("<!--","").replace("-->","")

data_pd = pd.read_html(response)
print(len(data_pd))

Output:

print(len(data_pd))
13

OR Using BEautifulSoup to co through the comments:

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd

url = 'https://www.pro-football-reference.com/years/2021/'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')

comments = data.find_all(string=lambda text: isinstance(text, Comment))

data_pd = pd.read_html(url)
for each in comments:
    if '<table' in str(each):
        data_pd.append(pd.read_html(str(each))[0])
        
print(len(data_pd))

Answered By - chitown88

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, September 12, 2022

[FIXED] Neither pandas.read_html nor BeautifulSoup can find all tables on webpage

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels