Issue
I have this data frame:
import pandas as pd
url = "https://www-genesis.destatis.de/genesisWS/rest/2020/data/tablefile?username=DEB924AL95&password=P@ssword123&name=45213-0005&area=all&compress=false&transpose=false&startyear=1900&endyear=×lices=®ionalvariable=®ionalkey=&classifyingvariable1=WERTE4&classifyingkey1=REAL&classifyingvariable2=WERT03&classifyingkey2=BV4KSB&classifyingvariable3=WZ2008&classifyingkey3=WZ08-551&format=xlsx&job=false&stand=01.01.1970&language=en"
df = pd.read_excel(url)
#df.head(20)
df = df.iloc[7:-5]
df
Turnover from accommodation and food services: Germany,\nmonths, price types, original and adjusted data, economic\nactivities Unnamed: 1 Unnamed: 2
WZ08-55 Accommodation NaN NaN
1994 January 121.9
NaN February 122.0
March 121.4
April 122.1
May 120.1
June 123.1
July 125.5
August 126.1
September 127.8
October 124.3
November 121.8
December 121.7
1995 January 120.9
NaN February 121.5
March 120.8
The expected outcome should be like this. WZ08-55 Accomodation
is the name of an industry. There are many such industry names in that column. These Industry names
ought to be Column Headers
. And the Years which begins in the rows next to the Industry names need to be clubbed with Months and form DATE column.
The Industry names are followed by the Year in the next row and rest of the rows are blank. I have no idea how to proceed.
Solution
Clean your data first then reshape your dataframe:
# Clean column names
df.columns = ['Variable', 'Date', 'Value']
# Boolean mask
m = df['Date'].isna()
# Clean data
df['Date'] += '-' + df['Variable'].ffill()
df['Variable'] = df['Variable'].where(m).ffill()
# Reshape your dataframe
out = (df[~m].replace('...', np.nan)
.pivot_table(index='Date', columns='Variable',
values='Value', sort=False)
.reset_index().rename_axis(columns=None))
Output:
>>> out
Date WZ08-55 Accommodation ... WZ08-561-01 Restaurants WZ08-55-01 Accommodation and food and beverage service act.
0 January-1994 121.9 ... 193.2 152.5
1 February-1994 122.0 ... 189.0 150.7
2 March-1994 121.4 ... 192.2 152.2
3 April-1994 122.1 ... 189.0 150.6
4 May-1994 120.1 ... 189.9 150.1
.. ... ... ... ... ...
351 April-2023 98.6 ... 90.0 91.8
352 May-2023 90.9 ... 85.4 87.3
353 June-2023 88.2 ... 84.7 87.3
354 July-2023 84.4 ... 84.3 84.7
355 August-2023 82.3 ... 83.2 83.2
[356 rows x 12 columns]
Edit: To get the expected date format, you can replace:
df['Date'] = (pd.to_datetime(df['Date'] + '-' + df['Variable'].ffill(), format='%B-%Y')
.dt.strftime('%b-%y'))
With:
df['Date'] = pd.to_datetime(df['Date'], format='%B-%Y').dt.strftime('%b-%y')
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.