Issue
I am using beautifulsoup to scrape website data. I am getting a handle on how to scrape things that are displayed on the webpage, however, there is a unique identifier embedded in the html that I want to grab that doesn't have a title. For example:
<tbody><tr ><th scope="row" class="right " data-stat="ranker" csk="1" >1</th><td class="left " data-stat="pos" csk="1" ><strong>C</strong></td><td class="left " data-append-csv="mccanja02" data-stat="player" csk="McCann,James" ><strong><a href="/players/m/mccanja02.shtml">James McCann</a></strong></td><td class="right " data-stat="age" >32</td><td class="right " data-stat="G" >13</td><td class="right " data-stat="PA" >42</td><td class="right " data-stat="AB" >36</td><td class="right " data-stat="R" >5</td><td class="right " data-stat="H" >7</td><td class="right " data-stat="2B" >2</td><td class="right iz" data-stat="3B" >0</td><td class="right " data-stat="HR" >1</td><td class="right " data-stat="RBI" >5</td><td class="right " data-stat="SB" >1</td><td class="right iz" data-stat="CS" >0</td><td class="right " data-stat="BB" >2</td><td class="right " data-stat="SO" >7</td><td class="right " data-stat="batting_avg" >.194</td><td class="right " data-stat="onbase_perc" >.286</td><td class="right " data-stat="slugging_perc" >.333</td><td class="right " data-stat="onbase_plus_slugging" >.619</td><td class="right " data-stat="onbase_plus_slugging_plus" >87</td><td class="right " data-stat="TB" >12</td><td class="right " data-stat="GIDP" >1</td><td class="right " data-stat="HBP" >3</td><td class="right iz" data-stat="SH" >0</td><td class="right " data-stat="SF" >1</td><td class="right iz" data-stat="IBB" >0</td></tr>
I want to grab just "mccanja02" because this can be used to add to a URL and direct to the players specific page. So far I've tried something like this:
# grab players UID
rowsUID = tableTeamBatting.find_all('tr')
for rowUID in rowsUID:
playerUID = rowUID.find('td', {'data-append-csv'})
if playerUID:
playerUID = playerUID.text
print(playerUID)
But there is no title to connect it with, like if I wanted to grab the player's name I could just do:
# grab players name
rows = tableTeamBatting.find_all('tr')
for row in rows:
players = []
player = row.find('td', {'data-stat' : 'player'})
if player:
player = player.text
print(player)
I couldn't get @F.Hoque's solution to output exactly so I made this monstrosity:
# grab players UID
rowsUID = tableTeamBatting.find_all('tr')
for rowUID in rowsUID:
playerUID = rowUID.select('a[href]')
playerUID = playerUID if playerUID else None
if playerUID == None:
continue
else:
pUID = str(playerUID)
pUID = pUID.split('/')
for p in range(len(pUID)):
if '.shtml' in pUID[p]:
stor = pUID[p].split('.shtml')
print(stor[0])
This gives me the pUID that I am looking for. The reason I could not use the code in the comment was because it would return this:
<td class="left" csk="McCann,James" data-append-csv="mccanja02" data-stat="player"><strong><a href="/players/m/mccanja02.shtml">James McCann</a></strong></td>
<td class="left" csk="Alonso,Pete" data-append-csv="alonspe01" data-stat="player"><strong><a href="/players/a/alonspe01.shtml">Pete Alonso</a></strong></td>
<td class="left" csk="McNeil,Jeff" data-append-csv="mcneije01" data-stat="player"><strong><a href="/players/m/mcneije01.shtml">Jeff McNeil</a>*</strong></td>
<td class="left" csk="Lindor,Francisco" data-append-csv="lindofr01" data-stat="player"><strong><a href="/players/l/lindofr01.shtml">Francisco Lindor</a>#</strong></td>...
And I was only looking for that data-append-csv=pUID. I appreciate the help though, I dug into some of the docs and was able to locate some stuff. I'm open to any suggestions on how to improve this.
Solution
mccanja02
is an attribute value of data-append-csv
. So you can't call .text
to grab it . You can grab it using css selector as follows:
html='''
<html>
<body>
<tbody>
<tr>
<th class="right" csk="1" data-stat="ranker" scope="row">
1
</th>
<td class="left" csk="1" data-stat="pos">
<strong>
C
</strong>
</td>
<td class="left" csk="McCann,James" data-append-csv="mccanja02" data-stat="player">
<strong>
<a href="/players/m/mccanja02.shtml">
James McCann
</a>
</strong>
</td>
<td class="right" data-stat="age">
32
</td>
<td class="right" data-stat="G">
13
</td>
<td class="right" data-stat="PA">
42
</td>
<td class="right" data-stat="AB">
36
</td>
<td class="right" data-stat="R">
5
</td>
<td class="right" data-stat="H">
7
</td>
<td class="right" data-stat="2B">
2
</td>
<td class="right iz" data-stat="3B">
0
</td>
<td class="right" data-stat="HR">
1
</td>
<td class="right" data-stat="RBI">
5
</td>
<td class="right" data-stat="SB">
1
</td>
<td class="right iz" data-stat="CS">
0
</td>
<td class="right" data-stat="BB">
2
</td>
<td class="right" data-stat="SO">
7
</td>
<td class="right" data-stat="batting_avg">
.194
</td>
<td class="right" data-stat="onbase_perc">
.286
</td>
<td class="right" data-stat="slugging_perc">
.333
</td>
<td class="right" data-stat="onbase_plus_slugging">
.619
</td>
<td class="right" data-stat="onbase_plus_slugging_plus">
87
</td>
<td class="right" data-stat="TB">
12
</td>
<td class="right" data-stat="GIDP">
1
</td>
<td class="right" data-stat="HBP">
3
</td>
<td class="right iz" data-stat="SH">
0
</td>
<td class="right" data-stat="SF">
1
</td>
<td class="right iz" data-stat="IBB">
0
</td>
</tr>
</tbody>
</body>
</html>
'''
from bs4 import BeautifulSoup
tableTeamBatting=BeautifulSoup(html,'lxml')
#print(soup.prettify())
rowsUID = tableTeamBatting.select('tr')
for rowUID in rowsUID:
playerUID = rowUID.select_one('td[data-append-csv]')
playerUID = playerUID.get('data-append-csv')if playerUID else None
print(playerUID)
Output:
mccanja02
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.