Issue
I am trying to append scraped data to a dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
import csv
url="https://en.wikipedia.org/wiki/List_of_German_football_champions"
page=requests.get(url).content
soup=BeautifulSoup(page,"html.parser")
seasons=[]
first_places=[]
runner_ups=[]
third_places=[]
top_scorrers=[]
tbody=soup.find_all("tbody")[7]
trs=tbody.find_all("tr")
for tr in trs:
season = tr.find_all("a")[0].text
first_place = tr.find_all("a")[1].text
runner_up = tr.find_all("a")[2].text
third_place = tr.find_all("a")[3].text
top_scorer = tr.find_all("a")[4].text
seasons.append(season)
first_places.append(first_place)
runner_ups.append(runner_up)
third_places.append(third_place)
top_scorrers.append(top_scorer)
tuples=list(zip(seasons,first_places,runner_ups,third_places,top_scorrers))
df=pd.DataFrame(tuples,columns=["Season","FirstPlace","RunnerUp","ThirdPlace","TopScorrer"])
df
Is there an easier way to append data directly to an empty dataframe without creating lists and then zipping them?
Solution
While still using pandas
"simplest" way to create your DataFrame is going with pandas.read_html()
:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_German_football_champions')[7]
To simply rename the columns and get rid of the [7]
:
df.columns = ['Season', 'Champions', 'Runners-up', 'Third place',
'Top scorer(s)', 'Goals']
Output:
Season | Champions | Runners-up | Third place | Top scorer(s) | Goals | |
---|---|---|---|---|---|---|
0 | 1963–64 | 1. FC Köln (2) | Meidericher SV | Eintracht Frankfurt | Uwe Seeler | 30 |
1 | 1964–65 | Werder Bremen (1) | 1. FC Köln | Borussia Dortmund | Rudi Brunnenmeier | 24 |
2 | 1965–66 | TSV 1860 Munich (1) | Borussia Dortmund | Bayern Munich | Friedhelm Konietzka | 26 |
3 | 1966–67 | Eintracht Braunschweig (1) | TSV 1860 Munich | Borussia Dortmund | Lothar Emmerich, Gerd Müller | 28 |
4 | 1967–68 | 1. FC Nürnberg (9) | Werder Bremen | Borussia Mönchengladbach | Hannes Löhr | 27 |
...
An alternativ to avoid all these lists, get cleaner in process and using BeautifulSoup
directly is to create more structured data - A single list of dicts:
data = []
for tr in soup.select('table:nth-of-type(8) tr:not(:has(th))'):
data.append({
'season':tr.find_all("a")[0].text,
'first_place': tr.find_all("a")[1].text,
'runner_up': tr.find_all("a")[2].text,
'third_place': tr.find_all("a")[3].text,
'top_scorer': tr.find_all("a")[4].text,
})
pd.DataFrame(data)
Example
import pandas as pd
from bs4 import BeautifulSoup
import requests
url="https://en.wikipedia.org/wiki/List_of_German_football_champions"
page=requests.get(url).content
soup=BeautifulSoup(page,"html.parser")
data = []
for tr in soup.select('table:nth-of-type(8) tr:not(:has(th))'):
data.append({
'season':tr.find_all("a")[0].text,
'first_place': tr.find_all("a")[1].text,
'runner_up': tr.find_all("a")[2].text,
'third_place': tr.find_all("a")[3].text,
'top_scorer': tr.find_all("a")[4].text,
})
pd.DataFrame(data)
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.