Tuesday, January 18, 2022

[FIXED] Creating a table off of 247sports website

January 18, 2022 beautifulsoup, pandas, selenium No comments

Issue

I am trying to create a pandas dataframe based off of the top 1000 recruits from the 2022 football recruiting class from the 247sports website in a google colab notebook. I currently am using the following code so far:

#Importing all necessary packages
import pandas as pd
import time
import datetime as dt
import os
import re
import requests
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import twofourseven

from bs4 import BeautifulSoup
from splinter import Browser
from kora.selenium import wd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import requests
from geopy.geocoders import Nominatim

from sklearn.model_selection import KFold
from sklearn.metrics import log_loss, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb

year = '2022'

url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeRecruitRankings?InstitutionGroup=HighSchool'

# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}

response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []

for tag in soup.find_all("li", class_="rankings-page__list-item"):  # `[1:]` Since the first result is a table header
    # meta = tag.find_all("span", class_="meta")

    rank = tag.find_next("div", class_="primary").text
    TwoFourSeven_rank = tag.find_next("div", class_="other").text
    name = tag.find_next("a", class_="rankings-page__name-link").text

    school = tag.find_next("span", class_="meta").text
    position = tag.find_next("div", class_="position").text
    height_weight = tag.find_next("div", class_="metrics").text
    rating = tag.find_next("span", class_="score").text
    nat_rank = tag.find_next("a", class_="natrank").text
    state_rank = tag.find_next("a", class_="sttrank").text
    pos_rank = tag.find_next("a", class_="posrank").text

    data.append(
        {
            "Rank": rank,
            "247 Rank": TwoFourSeven_rank,
            "Name": name,
            "School": school,
            "Class of": year,
            "Position": position,
            "Height & Weight": height_weight,
            "Rating": rating,
            "National Rank": nat_rank,
            "State Rank": state_rank,
            "Position Rank": pos_rank,
#            "School": ???,
        }
    )

    print(rank)

df = pd.DataFrame(data)

data

Ideally, I would also like to grab the school name the recruit chose from the logo on the table, but I am not sure how to go about that. For example, I would like to print out "Florida State" for the school column from this "row" of data.

Along with that, I do get an output of printing ranks, but afterwards, I get the following error that won't allow me to collect and/or print out additional data:

AttributeError                            Traceback (most recent call last)
<ipython-input-11-56f4779601f8> in <module>()
     16     # meta = tag.find_all("span", class_="meta")
     17 
---> 18     rank = tag.find_next("div", class_="primary").text
     19     # TwoFourSeven_rank = tag.find_next("div", class_="other").text
     20     name = tag.find_next("a", class_="rankings-page__name-link").text

AttributeError: 'NoneType' object has no attribute 'text'

Lastly, I do understand that this webpage only displays 50 recruits without having my python code click the "Load more" tab via selenium, but I am not 100% sure how to incorporate that in the most efficient and legible way possible. If anyone knows a good way to do all this, I'd greatly appreciate it. Thanks in advance.

Solution

Use try/except as some of the elements will not be present. Also no need to use Selenium. Simple requests will do.

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = 'https://247sports.com/Season/2022-Football/CompositeRecruitRankings/?InstitutionGroup=HighSchool'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}

rows = []
page = 0
while True:
    page +=1 
    print('Page: %s' %page)
    
    payload = {'Page': '%s' %page}
    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    athletes = soup.find_all('li',{'class':'rankings-page__list-item'})
    if len(athletes) == 0:
        break
    continue_loop = True
    while continue_loop == True:
        for athlete in athletes:
            if athlete.text.strip() == 'Load More':
                continue_loop = False
                continue
    
        
            primary_rank = athlete.find('div',{'class':'rank-column'}).find('div',{'class':'primary'}).text.strip()
            try:
                other_rank = athlete.find('div',{'class':'rank-column'}).find('div',{'class':'other'}).text.strip()
            except:
                other_rank = ''
            name = athlete.find('div',{'class':'recruit'}).find('a').text.strip()
            link = 'https://247sports.com' + athlete.find('div',{'class':'recruit'}).find('a')['href']
            highschool = ' '.join([x.strip() for x in athlete.find('div',{'class':'recruit'}).find('span',{'class':'meta'}).text.strip().split('\n')])
            pos = athlete.find('div',{'class':'position'}).text.strip()
            ht = athlete.find('div',{'class':'metrics'}).text.split('/')[0].strip()
            wt = athlete.find('div',{'class':'metrics'}).text.split('/')[1].strip()
            rating = athlete.find('span',{'class':'score'}).text.strip()
            nat_rank = athlete.find('a',{'class':'natrank'}).text.strip()
            pos_rank = athlete.find('a',{'class':'posrank'}).text.strip()
            st_rank = athlete.find('a',{'class':'sttrank'}).text.strip()
            
            try:
                team = athlete.find('div',{'class':'status'}).find('img')['title']
            except:
                team = ''
            
            row = {'Primary Rank':primary_rank,
                   'Other Rank':other_rank,
                   'Name':name,
                   'Link':link,
                   'Highschool':highschool,
                   'Position':pos,
                   'Height':ht,
                   'weight':wt,
                   'Rating':rating,
                   'National Rank':nat_rank,
                   'Position Rank':pos_rank,
                   'State Rank':st_rank,
                   'Team':team}
            
            rows.append(row)

df = pd.DataFrame(rows)

**Output: first 10 rows of 1321 rows - **

print(df.head(10).to_string())
  Primary Rank Other Rank                    Name                                                          Link                                   Highschool Position Height weight  Rating National Rank Position Rank State Rank           Team
0            1          1             Quinn Ewers             https://247sports.com/Player/Quinn-Ewers-45572600          Southlake Carroll (Southlake, TX)       QB    6-3    206  1.0000             1             1          1     Ohio State
1            2          3           Travis Hunter           https://247sports.com/Player/Travis-Hunter-46084728                 Collins Hill (Suwanee, GA)       CB    6-1    165  0.9993             2             1          1  Florida State
2            3          2            Walter Nolen            https://247sports.com/Player/Walter-Nolen-46083769   St. Benedict at Auburndale (Cordova, TN)       DL    6-4    300  0.9991             3             1          1               
3            4         14          Domani Jackson          https://247sports.com/Player/Domani-Jackson-46057101                  Mater Dei (Santa Ana, CA)       CB    6-1    185  0.9966             4             2          1            USC
4            5         10               Zach Rice               https://247sports.com/Player/Zach-Rice-46086346  Liberty Christian Academy (Lynchburg, VA)       OT    6-6    282  0.9951             5             1          1               
5            6          4  Gabriel Brownlow-Dindy  https://247sports.com/Player/Gabriel-Brownlow-Dindy-46084792                    Lakeland (Lakeland, FL)       DL    6-3    275  0.9946             6             2          1               
6            7          5          Shemar Stewart          https://247sports.com/Player/Shemar-Stewart-46080267             Monsignor Pace (Opa Locka, FL)       DL    6-5    260  0.9946             7             3          2               
7            8         20           Denver Harris           https://247sports.com/Player/Denver-Harris-46081216                  North Shore (Houston, TX)       CB    6-1    180  0.9944             8             3          2               
8            9         33             Travis Shaw             https://247sports.com/Player/Travis-Shaw-46057330                  Grimsley (Greensboro, NC)       DL    6-5    310  0.9939             9             4          1               
9           10         23          Devon Campbell          https://247sports.com/Player/Devon-Campbell-46093947                      Bowie (Arlington, TX)      IOL    6-3    310  0.9937            10             1          3

Answered By - chitown88

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 18, 2022

[FIXED] Creating a table off of 247sports website

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels