Issue
I'm trying to get the transfer history of the top 500 most valuable players on Transfermarkt. I've managed (with some help) to loop through each players profile and scraped image and name. Now I want the transfer history, which can be found in a table on each players profile: Player Profile
I want to save the table in a dataframe, using Pandas and then write it to a CSV, with Season, Date etc as headers. For Monaco and PSG, for example, I just want the names of the clubs, not pictures or Nationality. But right now, all I get is this:
Empty DataFrame
Columns: []
Index: []
Expected output:
Season Date Left Joined MV Fee
0 18/19 Jul 1, 2018 Monaco PSG 120.00m 145.00m
I've viewed the source and inspected the page, but can't find anything that helps me, apart from that the tbody and tr. But the way I'm doing it I want to precise that table, since there are several others.
This is my code:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
result = []
def main(url):
with requests.Session() as req:
result = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
tr = soup.find_all("tbody")[1].find_all("tr", recursive=False)
result.extend([
{
"Season": t[1].text.strip()
}
for t in (t.find_all(recursive=False) for t in tr)
])
df = pd.DataFrame(result)
print(df)
Solution
import requests
from bs4 import BeautifulSoup
import pandas as pd
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
with requests.Session() as req:
links = []
names = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
urls = [f"{url[:29]}{item.get('href')}" for item in soup.findAll(
"a", class_="spielprofil_tooltip")]
ns = [item.text for item in soup.findAll(
"a", class_="spielprofil_tooltip")][:-5]
links.extend(urls)
names.extend(ns)
return links, names
def parser():
links, names = main(site)
for link, name in zip(links, names):
with requests.Session() as req:
r = req.get(link, headers=headers)
df = pd.read_html(r.content)[1]
df.loc[-1] = name
df.index = df.index + 1
df.sort_index(inplace=True)
print(df)
parser()
Answered By - αԋɱҽԃ αмєяιcαη
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.