Tuesday, November 9, 2021

[FIXED] Extract data from a specific page using Python Beautifulsoup

November 09, 2021 beautifulsoup, python No comments

Issue

I'm very new to python and BeautifulSoup. I wrote the code below to try to call up the website (https://www.fangraphs.com/depthcharts.aspx?position=Team), scrape the data in the table and export it to a csv file. I was able to write code to extract data from other tables on the website, but not this particular one. It keeps coming back with: AttributeError: NoneType' object has no attribute 'find'. I've been racking my brain trying to figure out what I'm doing wrong. Do I have the wrong "class" name? Again, I've very new and trying to teach myself. I have been learning via trial and error and reverse engineering other's codes. This one has me stumped. Any guidance?

import requests
import csv
import datetime
from bs4 import BeautifulSoup

# static urls
season = datetime.datetime.now().year
URL = "https://www.fangraphs.com/depthcharts.aspx?position=Team".format(season=season)

# request the data
batting_html = requests.get(URL).text

def parse_array_from_fangraphs_html(input_html, out_file_name):
    """
    Take a HTML stats page from fangraphs and parse it out to a CSV file.
    """
    # parse input
    soup = BeautifulSoup(input_html, "lxml")
    table = soup.find("table", {"class": "tablesoreder, depth_chart tablesorter tablesorter-default"})

    # get headers
    headers_html = table.find("thead").find_all("th")
    headers = []
    for header in headers_html:
        headers.append(header.text)
    print(headers)

    # get rows
    rows = []
    rows_html = table.find("tbody").find_all("tr")
    for row in rows_html:
        row_data = []
        for cell in row.find_all("td"):
            row_data.append(cell.text)
        rows.append(row_data)

    # write to CSV file
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(headers)
        writer.writerows(rows)

parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')

Solution

The traceback looks like

AttributeError                            Traceback (most recent call last)
<ipython-input-4-ee944e08f675> in <module>()
     41         writer.writerows(rows)
     42 
---> 43 parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')

<ipython-input-4-ee944e08f675> in parse_array_from_fangraphs_html(input_html, out_file_name)
     20 
     21     # get headers
---> 22     headers_html = table.find("thead").find_all("th")
     23     headers = []
     24     for header in headers_html:

AttributeError: 'NoneType' object has no attribute 'find'

So yes, the problem is in the

table = soup.find("table", {"class": "tablesoreder, depth_chart tablesorter tablesorter-default"})

Instruction.

You could modify it in order to split the class attribute upon the spaces, as suggested by another user. But then you would be getting another failure because the parsed table has no tbody.

The fixed script would look like

import requests
import csv
import datetime
from bs4 import BeautifulSoup

# static urls
season = datetime.datetime.now().year
URL = "https://www.fangraphs.com/depthcharts.aspx?position=Team".format(season=season)

# request the data
batting_html = requests.get(URL).text

def parse_array_from_fangraphs_html(input_html, out_file_name):
    """
    Take a HTML stats page from fangraphs and parse it out to a CSV file.
    """
    # parse input
    soup = BeautifulSoup(input_html, "lxml")
    table = soup.find("table", class_=["tablesoreder,", "depth_chart", "tablesorter", "tablesorter-default"])

    # get headers
    headers_html = table.find("thead").find_all("th")
    headers = []
    for header in headers_html:
        headers.append(header.text)
    print(headers)

    # get rows
    rows = []
    rows_html = table.find_all("tr")
    for row in rows_html:
        row_data = []
        for cell in row.find_all("td"):
            row_data.append(cell.text)
        rows.append(row_data)

    # write to CSV file
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(headers)
        writer.writerows(rows)

parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')

Answered By - nilleb

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 9, 2021

[FIXED] Extract data from a specific page using Python Beautifulsoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels