Issue
So I am webs scraping the sofifa website into a workable csv. Each player gets a column. My main problem is the position section of the website is only exporting the first position whenever I try to iterate through it. Ideally I would like all of the positions to be to be in the same column seperated by a comma.
Here is the source HTML and picture Sofifa Website 1
<tr>
<td class="col-avatar"><figure class="avatar">
<img alt="" data-src="https://cdn.sofifa.com/players/240/950/21_60.png" data-srcset="https://cdn.sofifa.com/players/240/950/21_120.png 2x, https://cdn.sofifa.com/players/240/950/21_180.png 3x" src="https://cdn.sofifa.com/players/240/950/21_60.png" data-root="https://cdn.sofifa.com/players/" data-type="player" id="240950" class="player-check loaded" srcset="https://cdn.sofifa.com/players/240/950/21_120.png 2x, https://cdn.sofifa.com/players/240/950/21_180.png 3x" data-was-processed="true"></figure></td>
<td class="col-name">
<a class="tooltip" href="/player/240950/pedro-antonio-pereira-goncalves/210058/" data-tooltip="Pedro António Pereira Gonçalves"><div class="bp3-text-overflow-ellipsis"><img title="Portugal" alt="" src="https://cdn.sofifa.com/flags/pt.png" data-src="https://cdn.sofifa.com/flags/pt.png" data-srcset="https://cdn.sofifa.com/flags/[email protected] 2x, https://cdn.sofifa.com/flags/[email protected] 3x" class="flag loaded" srcset="https://cdn.sofifa.com/flags/[email protected] 2x, https://cdn.sofifa.com/flags/[email protected] 3x" data-was-processed="true"> Pedro Gonçalves</div></a><a rel="nofollow" href="/players?pn=23"><span class="pos pos23">RW</span></a> <a rel="nofollow" href="/players?pn=14"><span class="pos pos14">CM</span></a></td><td class="col col-ae" data-col="ae">22</td><td class="col col-oa" data-col="oa"><span class="bp3-tag p p-79">79</span></td><td class="col col-pt" data-col="pt"><span class="bp3-tag p p-87">87</span></td><td class="col-name">
<div class="bp3-text-overflow-ellipsis"><figure class="avatar avatar-sm transparent">
<img alt="" class="team loaded" data-src="https://cdn.sofifa.com/teams/237/30.png" data-srcset="https://cdn.sofifa.com/teams/237/60.png 2x, https://cdn.sofifa.com/teams/237/90.png 3x" src="https://cdn.sofifa.com/teams/237/30.png" data-root="https://cdn.sofifa.com/teams/" data-type="team" srcset="https://cdn.sofifa.com/teams/237/60.png 2x, https://cdn.sofifa.com/teams/237/90.png 3x" data-was-processed="true">
</figure>
<a href="/team/237/sporting-cp/">Sporting CP</a><div class="sub">
2020 ~ 2025</div>
</div>
</td><td class="col col-vl" data-col="vl">€39.5M</td><td class="col col-wg" data-col="wg">€16K</td><td class="col col-tt" data-col="tt"><span class="bp3-tag p">2021</span></td><td class="col-comment">
5.2K</td>
</tr>
This is my webscraping API
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
# Get basic players information for all players
base_url = "https://sofifa.com/players?offset="
columns = ['ID', 'Name', 'Age', 'Positions','Nationality', 'Overall', 'Potential', 'Club', 'Value', 'Wage',]
data = pd.DataFrame(columns = columns)
for offset in range(0, 335):
url = base_url + str(offset * 60)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
table_body = soup.find('tbody')
for row in table_body.findAll('tr'):
td = row.findAll('td')
pid = td[0].find('img').get('id')
nationality = td[1].find('img').get('title')
name = td[1].find("a").get("data-tooltip")
rel = td[1].findAll('a',{'rel': 'nofollow'})
pos= rel[0].findAll('span')
for span in pos :
positions= (span.text.split)
age = td[2].text
overall = td[3].text.strip()
potential = td[4].text.strip( )
club = td[5].find('a').text
value = td[6].text.strip()
wage = td[7].text.strip()
player_data = pd.DataFrame([[pid, name, age, positions, nationality, overall, potential, club, value, wage]])
player_data.columns = columns
data = data.append(player_data, ignore_index=True)
print("done for "+str(offset),end="\r")
data.drop_duplicates()
data.head()
data.to_csv('player data.csv', encoding='utf-8-sig')
it yields this output
Excel Output2
Solution
To get positions as string separated by comma, you can try:
import requests
from bs4 import BeautifulSoup
def get_data(offset):
url = "https://sofifa.com/players?offset=" + str(offset * 60)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
rv = []
for row in soup.select("tbody tr"):
id_ = row.select_one("img[id]")["id"]
name = row.select_one(".col-name [data-tooltip]")["data-tooltip"]
age = row.select_one(".col-ae").get_text(strip=True)
positions = [p.get_text(strip=True) for p in row.select("span.pos")]
nationality = row.select_one("img.flag")["title"]
overall = row.select_one(".col-oa").get_text(strip=True)
potential = row.select_one(".col-pt").get_text(strip=True)
club = row.select_one(".col-name > div > a").get_text(strip=True)
# sometimes there isn't any club, just country:
if club == "":
club = row.select_one(".col-name > div > a")["title"]
value = row.select_one(".col-vl").get_text(strip=True)
wage = row.select_one(".col-wg").get_text(strip=True)
rv.append(
[
id_,
name,
age,
", ".join(positions),
nationality,
overall,
potential,
club,
value,
wage,
]
)
return rv
all_data = []
for offset in range(0, 3): # <--- increase offset here
print("Offset {}...".format(offset))
all_data.extend(get_data(offset))
df = pd.DataFrame(
all_data,
columns=[
"ID",
"Name",
"Age",
"Positions",
"Nationality",
"Overall",
"Potential",
"Club",
"Value",
"Wage",
],
)
print(df)
df.to_csv("data.csv", index=False)
Prints:
...
141 241637 Aurélien Tchouaméni 20 CM, CDM France 77 85 AS Monaco €23M €35K
142 258315 Bright Akwo Arrey-Mbi 17 CB, LB Germany 62 85 Bayern München II €1.2M €500
143 245367 Xavi Simons 17 CM Netherlands 65 84 Paris Saint-Germain €1.8M €2K
144 207865 Marcos Aoás Corrêa 26 CB, CDM Brazil 87 90 Paris Saint-Germain €92.5M €135K
145 241852 Moussa Diaby 20 LW, LM France 81 88 Bayer 04 Leverkusen €51M €60K
146 188567 Pierre-Emerick Aubameyang 31 ST, LW Gabon 85 85 Arsenal €45.5M €145K
...
and saves data.csv
(screenshot from LibreOffice):
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.