Tuesday, November 16, 2021

[FIXED] Pandas return empty dataframe when trying to scrape table

November 16, 2021 beautifulsoup, python, web-scraping No comments

Issue

I'm trying to get the transfer history of the top 500 most valuable players on Transfermarkt. I've managed (with some help) to loop through each players profile and scraped image and name. Now I want the transfer history, which can be found in a table on each players profile: Player Profile

I want to save the table in a dataframe, using Pandas and then write it to a CSV, with Season, Date etc as headers. For Monaco and PSG, for example, I just want the names of the clubs, not pictures or Nationality. But right now, all I get is this:

Empty DataFrame
Columns: []
Index: []

Expected output:

Season         Date    Left Joined       MV      Fee
0  18/19  Jul 1, 2018  Monaco    PSG  120.00m  145.00m

I've viewed the source and inspected the page, but can't find anything that helps me, apart from that the tbody and tr. But the way I'm doing it I want to precise that table, since there are several others.

This is my code:

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}

result = []

def main(url):
    with requests.Session() as req:
        result = []
        for item in range(1, 21):
            print(f"Collecting Links From Page# {item}")
            r = req.get(url.format(item), headers=headers)
            soup = BeautifulSoup(r.content, 'html.parser')

            tr = soup.find_all("tbody")[1].find_all("tr", recursive=False)

            result.extend([
                { 
                    "Season": t[1].text.strip()

                }
                for t in (t.find_all(recursive=False) for t in tr)
            ])

df = pd.DataFrame(result)

print(df)

Solution

import requests
from bs4 import BeautifulSoup
import pandas as pd

site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}


def main(url):
    with requests.Session() as req:
        links = []
        names = []

        for item in range(1, 21):
            print(f"Collecting Links From Page# {item}")
            r = req.get(url.format(item), headers=headers)
            soup = BeautifulSoup(r.content, 'html.parser')
            urls = [f"{url[:29]}{item.get('href')}" for item in soup.findAll(
                "a", class_="spielprofil_tooltip")]
            ns = [item.text for item in soup.findAll(
                "a", class_="spielprofil_tooltip")][:-5]
            links.extend(urls)
            names.extend(ns)
    return links, names


def parser():
    links, names = main(site)
    for link, name in zip(links, names):
        with requests.Session() as req:
            r = req.get(link, headers=headers)
            df = pd.read_html(r.content)[1]
            df.loc[-1] = name
            df.index = df.index + 1
            df.sort_index(inplace=True)
            print(df)


parser()

Answered By - αԋɱҽԃ αмєяιcαη

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 16, 2021

[FIXED] Pandas return empty dataframe when trying to scrape table

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels