Monday, November 20, 2023

[FIXED] Pandas data frame transformations

November 20, 2023 data-wrangling, pandas, python No comments

Issue

I have this data frame:

import pandas as pd

url = "https://www-genesis.destatis.de/genesisWS/rest/2020/data/tablefile?username=DEB924AL95&password=P@ssword123&name=45213-0005&area=all&compress=false&transpose=false&startyear=1900&endyear=&timeslices=&regionalvariable=&regionalkey=&classifyingvariable1=WERTE4&classifyingkey1=REAL&classifyingvariable2=WERT03&classifyingkey2=BV4KSB&classifyingvariable3=WZ2008&classifyingkey3=WZ08-551&format=xlsx&job=false&stand=01.01.1970&language=en"


df = pd.read_excel(url)

#df.head(20)

df = df.iloc[7:-5]
df
                                        
Turnover from accommodation and food services: Germany,\nmonths, price types, original and adjusted data, economic\nactivities  Unnamed: 1  Unnamed: 2

WZ08-55 Accommodation       NaN     NaN  
                1994    January     121.9
                 NaN   February     122.0
                          March     121.4
                          April     122.1
                            May     120.1
                           June     123.1
                           July     125.5
                         August     126.1
                      September     127.8
                        October     124.3
                       November     121.8
                       December     121.7
               1995     January     120.9
                NaN    February     121.5
                          March     120.8

The expected outcome should be like this. WZ08-55 Accomodation is the name of an industry. There are many such industry names in that column. These Industry names ought to be Column Headers. And the Years which begins in the rows next to the Industry names need to be clubbed with Months and form DATE column.

The Industry names are followed by the Year in the next row and rest of the rows are blank. I have no idea how to proceed.

Solution

Clean your data first then reshape your dataframe:

# Clean column names
df.columns = ['Variable', 'Date', 'Value']

# Boolean mask
m = df['Date'].isna()

# Clean data
df['Date'] += '-' + df['Variable'].ffill()
df['Variable'] = df['Variable'].where(m).ffill()

# Reshape your dataframe
out = (df[~m].replace('...', np.nan)
             .pivot_table(index='Date', columns='Variable',
                          values='Value', sort=False)
             .reset_index().rename_axis(columns=None))

Output:

>>> out
              Date  WZ08-55 Accommodation  ...  WZ08-561-01 Restaurants  WZ08-55-01 Accommodation and food and beverage service act.
0     January-1994                  121.9  ...                    193.2                                              152.5          
1    February-1994                  122.0  ...                    189.0                                              150.7          
2       March-1994                  121.4  ...                    192.2                                              152.2          
3       April-1994                  122.1  ...                    189.0                                              150.6          
4         May-1994                  120.1  ...                    189.9                                              150.1          
..             ...                    ...  ...                      ...                                                ...          
351     April-2023                   98.6  ...                     90.0                                               91.8          
352       May-2023                   90.9  ...                     85.4                                               87.3          
353      June-2023                   88.2  ...                     84.7                                               87.3          
354      July-2023                   84.4  ...                     84.3                                               84.7          
355    August-2023                   82.3  ...                     83.2                                               83.2          

[356 rows x 12 columns]

Edit: To get the expected date format, you can replace:

df['Date'] = (pd.to_datetime(df['Date'] + '-' + df['Variable'].ffill(), format='%B-%Y')
                .dt.strftime('%b-%y'))

With:

df['Date'] = pd.to_datetime(df['Date'], format='%B-%Y').dt.strftime('%b-%y')

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 20, 2023

[FIXED] Pandas data frame transformations

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels