Tuesday, February 15, 2022

[FIXED] BeautifulSoup scraper doesn't retrieve any information

February 15, 2022 beautifulsoup, html, python No comments

Issue

I am trying to retrieve football squads data from multiple wikipedia pages and put it in a Pandas Data frame. One example of the source is this [link][1], but I want too do this for links between 1930-2018.

The code that I will show used to work in Python 2 and I'm trying to adapt it to Python 3. The information in every page are multiple tables with 7 columns. All of the tables have the same format.

The code used to crash but now is running. The only problem is that it throws an empty .csv file.

Just to put more context I made some specific changes :

Python 2

path = os.path.join('.cache', hashlib.md5(url).hexdigest() + '.html')

Python 3

path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')

Python 2

open(path, 'w') as fd:

Python 3

open(path, 'wb') as fd:

Python 2

years = range(1930,1939,4) + range(1950,2015,4)

Python 3: Yes here I also changed the range so I could get World Cup 2018

years = list(range(1930,1939,4)) + list(range(1950,2019,4))

This is the whole chunk of code. If somebody can spot where in the world is the problem and give a solution I would be very thankful.

import hashlib
import requests
from bs4 import BeautifulSoup
import pandas as pd

if not os.path.exists('.cache'):
    os.makedirs('.cache')

ua = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/15612.1.29.41.4'
session = requests.Session()

def get(url):
    '''Return cached lxml tree for url'''
    path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
    if not os.path.exists(path):
        print(url)
        response = session.get(url, headers={'User-Agent': ua})
        with open(path, 'wb') as fd:
            fd.write(response.text.encode('utf-8'))
    return BeautifulSoup(open(path), 'html.parser')

def squads(url):
    result = []
    soup = get(url)
    year = url[29:33]
    for table in soup.find_all('table','sortable'):
        if "wikitable" not in table['class']:
            country = table.find_previous("span","mw-headline").text
            for tr in table.find_all('tr')[1:]:
                cells = [td.text.strip() for td in tr.find_all('td')]
                cells += [country, td.a.get('title') if td.a else 'none', year]
                result.append(cells)
    return result

years = list(range(1930,1939,4)) + list(range(1950,2019,4))
result = []
for year in years:
    url = "http://en.wikipedia.org/wiki/"+str(year)+"_FIFA_World_Cup_squads"
    result += squads(url)

Final_result = pd.DataFrame(result)
Final_result.to_csv('/Users/home/Downloads/data.csv', index=False, encoding='iso-8859-1')```





  [1]: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads

Solution

To get information about each team for years 1930-2018 you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/{}_FIFA_World_Cup_squads"

dfs = []
for year in range(1930, 2019):
    print(year)

    soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
    tables = soup.find_all(
        lambda tag: tag.name == "table"
        and tag.select_one('th:-soup-contains("Pos.")')
    )

    for table in tables:
        for tag in table.select('[style="display:none"]'):
            tag.extract()
        df = pd.read_html(str(table))[0]
        df["Year"] = year
        df["Country"] = table.find_previous(["h3", "h2"]).span.text
        dfs.append(df)

df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=False)

Prints:


...
13   14   FW                    Moussa Konaté       3 April 1993 (aged 25)    28                                       Amiens  2018                 Senegal   10.0
14   15   FW                     Diafra Sakho   24 December 1989 (aged 28)    12                                       Rennes  2018                 Senegal    3.0
15   16   GK                   Khadim N'Diaye       5 April 1985 (aged 33)    26                                       Horoya  2018                 Senegal    0.0
16   17   MF                     Badou Ndiaye    27 October 1990 (aged 27)    20                                   Stoke City  2018                 Senegal    1.0
17   18   FW                     Ismaïla Sarr   25 February 1998 (aged 20)    16                                       Rennes  2018                 Senegal    3.0
18   19   FW                     M'Baye Niang   19 December 1994 (aged 23)     7                                       Torino  2018                 Senegal    0.0
19   20   FW                      Keita Baldé       8 March 1995 (aged 23)    19                                       Monaco  2018                 Senegal    3.0
20   21   DF                   Lamine Gassama    20 October 1989 (aged 28)    36                                   Alanyaspor  2018                 Senegal    0.0
21   22   DF                     Moussa Wagué     4 October 1998 (aged 19)    10                                        Eupen  2018                 Senegal    0.0
22   23   GK                     Alfred Gomis   5 September 1993 (aged 24)     1                                         SPAL  2018                 Senegal    0.0

and saves data.csv (screenshot from LibreOffice):

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 15, 2022

[FIXED] BeautifulSoup scraper doesn't retrieve any information

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels