Issue
I am trying to retrieve football squads data from multiple wikipedia pages and put it in a Pandas Data frame. One example of the source is this [link][1], but I want too do this for links between 1930-2018.
The code that I will show used to work in Python 2 and I'm trying to adapt it to Python 3. The information in every page are multiple tables with 7 columns. All of the tables have the same format.
The code used to crash but now is running. The only problem is that it throws an empty .csv file.
Just to put more context I made some specific changes :
Python 2
path = os.path.join('.cache', hashlib.md5(url).hexdigest() + '.html')
Python 3
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
Python 2
open(path, 'w') as fd:
Python 3
open(path, 'wb') as fd:
Python 2
years = range(1930,1939,4) + range(1950,2015,4)
Python 3: Yes here I also changed the range so I could get World Cup 2018
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
This is the whole chunk of code. If somebody can spot where in the world is the problem and give a solution I would be very thankful.
import hashlib
import requests
from bs4 import BeautifulSoup
import pandas as pd
if not os.path.exists('.cache'):
os.makedirs('.cache')
ua = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/15612.1.29.41.4'
session = requests.Session()
def get(url):
'''Return cached lxml tree for url'''
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
if not os.path.exists(path):
print(url)
response = session.get(url, headers={'User-Agent': ua})
with open(path, 'wb') as fd:
fd.write(response.text.encode('utf-8'))
return BeautifulSoup(open(path), 'html.parser')
def squads(url):
result = []
soup = get(url)
year = url[29:33]
for table in soup.find_all('table','sortable'):
if "wikitable" not in table['class']:
country = table.find_previous("span","mw-headline").text
for tr in table.find_all('tr')[1:]:
cells = [td.text.strip() for td in tr.find_all('td')]
cells += [country, td.a.get('title') if td.a else 'none', year]
result.append(cells)
return result
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
result = []
for year in years:
url = "http://en.wikipedia.org/wiki/"+str(year)+"_FIFA_World_Cup_squads"
result += squads(url)
Final_result = pd.DataFrame(result)
Final_result.to_csv('/Users/home/Downloads/data.csv', index=False, encoding='iso-8859-1')```
[1]: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads
Solution
To get information about each team for years 1930-2018 you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/{}_FIFA_World_Cup_squads"
dfs = []
for year in range(1930, 2019):
print(year)
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
tables = soup.find_all(
lambda tag: tag.name == "table"
and tag.select_one('th:-soup-contains("Pos.")')
)
for table in tables:
for tag in table.select('[style="display:none"]'):
tag.extract()
df = pd.read_html(str(table))[0]
df["Year"] = year
df["Country"] = table.find_previous(["h3", "h2"]).span.text
dfs.append(df)
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=False)
Prints:
...
13 14 FW Moussa Konaté 3 April 1993 (aged 25) 28 Amiens 2018 Senegal 10.0
14 15 FW Diafra Sakho 24 December 1989 (aged 28) 12 Rennes 2018 Senegal 3.0
15 16 GK Khadim N'Diaye 5 April 1985 (aged 33) 26 Horoya 2018 Senegal 0.0
16 17 MF Badou Ndiaye 27 October 1990 (aged 27) 20 Stoke City 2018 Senegal 1.0
17 18 FW Ismaïla Sarr 25 February 1998 (aged 20) 16 Rennes 2018 Senegal 3.0
18 19 FW M'Baye Niang 19 December 1994 (aged 23) 7 Torino 2018 Senegal 0.0
19 20 FW Keita Baldé 8 March 1995 (aged 23) 19 Monaco 2018 Senegal 3.0
20 21 DF Lamine Gassama 20 October 1989 (aged 28) 36 Alanyaspor 2018 Senegal 0.0
21 22 DF Moussa Wagué 4 October 1998 (aged 19) 10 Eupen 2018 Senegal 0.0
22 23 GK Alfred Gomis 5 September 1993 (aged 24) 1 SPAL 2018 Senegal 0.0
and saves data.csv
(screenshot from LibreOffice):
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.