Wednesday, October 20, 2021

[FIXED] Fill in missing date values within a dataframe

October 20, 2021 numpy, pandas, python No comments

Issue

I have a dataframe, df, where I would like to fill in missing date values, as well as missing ids. I also have a few id values that I would like to add into the dataframe, such as 'dd', 'ee' and so on.

Data

id  date    pwr
aa  Q1.22   10
aa  Q1.22   1
aa  Q2.22   1
aa  Q2.22   5
bb  Q1.22   5
bb  Q1.22   1
bb  Q2.22   1
bb  Q2.22   1
cc  Q1.22   2
cc  Q2.22   2

Desired

id  date    pwr
aa  Q1.22   10
aa  Q1.22   1
aa  Q2.22   1
aa  Q2.22   5
aa  Q3.22   
aa  Q4.22   
bb  Q1.22   5
bb  Q1.22   1
bb  Q2.22   1
bb  Q2.22   1
bb  Q3.22   
bb  Q4.22   
cc  Q1.22   2
cc  Q2.22   2
cc  Q3.22   
cc  Q4.22   
dd  Q1.22   
dd  Q2.22   
dd  Q3.22   
dd  Q4.22

Doing

I believe I have to establish a range, but not sure how to include the dates if they are quarters. I am still researching. Any suggestion is appreciated.

r = pd.date_range(start=df.dt.min(), end=df.dt.max())
df.set_index('dt').reindex(r).fillna(0.0).rename_axis('dt').reset_index()

Solution

Convert the date to a period date type:

pat = r"(?P<Q>.)(?P<quarter>\d+)\.(?P<year>.+)"
repl = lambda m: f"{m.group('quarter')}{m.group('Q')}20{m.group('year')}"
df = df.assign(date = df.date
                       .str.replace(pat, repl, regex=True)
                       .transform(pd.Period)
               )

df
 
   id    date  pwr
0  aa  2022Q1   10
1  aa  2022Q1    1
2  aa  2022Q2    1
3  aa  2022Q2    5
4  bb  2022Q1    5
5  bb  2022Q1    1
6  bb  2022Q2    1
7  bb  2022Q2    1
8  cc  2022Q1    2
9  cc  2022Q2    2

df.dtypes
 
id             object
date    period[Q-DEC]
pwr             int64
dtype: object

A convenient method to expose the missing values would be to use the complete function from pyjanitor:

# create the period ranges within a dictionary
new_values = {"date" : lambda date: pd.period_range(date.min(), 
                                          periods = 4)}

new_values = [new_values]

# pip install pyjanitor
import janitor
import pandas as pd
df.complete(new_values, by='id')
 
    id    date   pwr
0   aa  2022Q1  10.0
1   aa  2022Q1   1.0
2   aa  2022Q2   1.0
3   aa  2022Q2   5.0
4   aa  2022Q3   NaN
5   aa  2022Q4   NaN
6   bb  2022Q1   5.0
7   bb  2022Q1   1.0
8   bb  2022Q2   1.0
9   bb  2022Q2   1.0
10  bb  2022Q3   NaN
11  bb  2022Q4   NaN
12  cc  2022Q1   2.0
13  cc  2022Q2   2.0
14  cc  2022Q3   NaN
15  cc  2022Q4   NaN

Sticking to Pandas only, let's get the unique rows from df :

temp = df.drop_duplicates(['id', 'date'])

The reason for the unique values is so that we can create period ranges per id and reindex (reindex does not work with non unique indices):

temp = (temp
        .set_index('date')
        .groupby('id')
        .apply(lambda df: df.reindex(pd.period_range(df.index.min(), 
                                                     periods = 4)))
        .drop(columns=['id', 'pwr'])
        )

Let's join temp back to df:

In [119]: df.merge(temp, 
                   left_on=['id', 'date'], 
                   right_index = True, 
                   how = 'right')
Out[119]: 
   id    date   pwr
0  aa  2022Q1  10.0
1  aa  2022Q1   1.0
2  aa  2022Q2   1.0
3  aa  2022Q2   5.0
9  aa  2022Q3   NaN
9  aa  2022Q4   NaN
4  bb  2022Q1   5.0
5  bb  2022Q1   1.0
6  bb  2022Q2   1.0
7  bb  2022Q2   1.0
9  bb  2022Q3   NaN
9  bb  2022Q4   NaN
8  cc  2022Q1   2.0
9  cc  2022Q2   2.0
9  cc  2022Q3   NaN
9  cc  2022Q4   NaN

Answered By - sammywemmy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 20, 2021

[FIXED] Fill in missing date values within a dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels