Issue
So I am trying to get a table off of https://www.baseball-reference.com/register/team.cgi?id=9995d2a1, specifically the one labeled "Team Pitching", which is hidden in an html comment, preventing me from using pd.read_html() or another simpler method. I have gotten to the point where I have all of the data in a data frame, but my issue is that players with an asterisk in their name because they are left handed dissapear. Meaning their names turn to 'None', but I really need to remove the '*' so that the name reads.
This is what I did to get what I have so far with the 'None' as a name for lefties:
page = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/team.cgi?id=b0a9f9bc').text, features = 'lxml')
tbls = []
for comment in page.find_all(text=lambda text: isinstance(text, Comment)):
if comment.find("<table ") > 0:
comment_soup = BeautifulSoup(comment, 'lxml')
table = comment_soup.find("table")
tbls.append(table)
def parse_row(row):
return [str(x.string) for x in row.find_all('td')]
# pitching table
pitching_tbl = tbls[0]
# html text only used for finding names
html = BeautifulSoup(pitching_tbl.text, features = 'lxml')
rows = pitching_tbl.find_all('tr')
data = pd.DataFrame([parse_row(row) for row in rows])
What I would like to be able to do is loop through the text within the pitching_tbl text, and change it in place if there is an asterisk and use .replace('*', ''), and have the actual html within pitching_tbl be changed.
any help is appriciated!
Solution
The desired table data is in html comment.So You can invoke beautifulsoup built-in package which is Comment
with lambda function to grab data.
import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url='https://www.baseball-reference.com/register/team.cgi?id=9995d2a1'
req=requests.get(url)
soup=BeautifulSoup(req.text,'lxml')
df = pd.read_html([x for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_team_pitching"' in x][0])[0]
print(df)
Output:
Rk Name Age W L W-L% ... H9 HR9 BB9 SO9 SO/W Notes
0 1.0 Logan Bursick-Harrington 21.0 0 2 0.000 ... 4.5 0.0 15.8 15.8 1.00 NaN
1 2.0 Cylis Cox* 19.0 1 0 1.000 ... 23.1 0.0 7.7 11.6 1.50 NaN
2 3.0 Travis Densmore* 21.0 0 1 0.000 ... 7.2 0.0 1.8 14.4 8.00 NaN
3 4.0 Dylan Freeman 22.0 1 0 1.000 ... 13.5 1.1 3.4 14.6 4.33 NaN
4 5.0 Zach Hopman* 22.0 0 1 0.000 ... 12.8 0.0 9.9 11.4 1.14 NaN
5 6.0 Eamon Horwedel 22.0 1 0 1.000 ... 9.0 0.0 6.4 6.4 1.00 NaN
6 7.0 Tyler Johnson 19.0 0 0 NaN ... 5.4 0.0 2.7 10.8 4.00 NaN
7 8.0 Trent Jones 20.0 0 0 NaN ... 14.6 1.1 2.3 12.4 5.50 NaN
8 9.0 Tanner Knapp 21.0 1 1 0.500 ... 11.6 0.0 7.7 4.8 0.63 NaN
9 10.0 Mason Majors 22.0 1 0 1.000 ... 4.9 0.0 7.4 12.3 1.67 NaN
10 11.0 Mason Meeks 21.0 0 1 0.000 ... 6.3 0.9 3.6 5.4 1.50 NaN
11 12.0 Sam Nagelvoort 19.0 0 1 0.000 ... 18.0 2.3 22.5 9.0 0.40 NaN
12 13.0 Tyler Nichol 20.0 0 0 NaN ... 27.0 0.0 27.0 0.0 0.00 NaN
13 14.0 Cole Russo 19.0 0 0 NaN ... 27.0 13.5 0.0 0.0 NaN NaN
14 15.0 Kyle Salley* 22.0 0 1 0.000 ... 9.0 2.3 22.5 9.0 0.40 NaN
15 16.0 Noah Stants 21.0 0 0 NaN ... 4.3 1.4 7.1 11.4 1.60 NaN
16 17.0 Quinn Waterhouse* 21.0 0 0 NaN ... 4.5 0.0 4.5 18.0 4.00 NaN
17 18.0 Nick Weyrich 19.0 0 0 NaN ... 6.4 1.3 7.7 11.6 1.50 NaN
18 19.0 Adam Wheaton 23.0 0 1 0.000 ... 11.7 1.8 4.5 12.6 2.80 NaN
19 NaN 19 Players 20.9 5 9 0.357 ... 9.2 0.8 6.9 10.7 1.55 NaN
[20 rows x 32 columns]
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.