Issue
I have a dataframe, df, where I would like to fill in missing date values, as well as missing ids. I also have a few id values that I would like to add into the dataframe, such as 'dd', 'ee' and so on.
Data
id date pwr
aa Q1.22 10
aa Q1.22 1
aa Q2.22 1
aa Q2.22 5
bb Q1.22 5
bb Q1.22 1
bb Q2.22 1
bb Q2.22 1
cc Q1.22 2
cc Q2.22 2
Desired
id date pwr
aa Q1.22 10
aa Q1.22 1
aa Q2.22 1
aa Q2.22 5
aa Q3.22
aa Q4.22
bb Q1.22 5
bb Q1.22 1
bb Q2.22 1
bb Q2.22 1
bb Q3.22
bb Q4.22
cc Q1.22 2
cc Q2.22 2
cc Q3.22
cc Q4.22
dd Q1.22
dd Q2.22
dd Q3.22
dd Q4.22
Doing
I believe I have to establish a range, but not sure how to include the dates if they are quarters. I am still researching. Any suggestion is appreciated.
r = pd.date_range(start=df.dt.min(), end=df.dt.max())
df.set_index('dt').reindex(r).fillna(0.0).rename_axis('dt').reset_index()
Solution
Convert the date to a period date type:
pat = r"(?P<Q>.)(?P<quarter>\d+)\.(?P<year>.+)"
repl = lambda m: f"{m.group('quarter')}{m.group('Q')}20{m.group('year')}"
df = df.assign(date = df.date
.str.replace(pat, repl, regex=True)
.transform(pd.Period)
)
df
id date pwr
0 aa 2022Q1 10
1 aa 2022Q1 1
2 aa 2022Q2 1
3 aa 2022Q2 5
4 bb 2022Q1 5
5 bb 2022Q1 1
6 bb 2022Q2 1
7 bb 2022Q2 1
8 cc 2022Q1 2
9 cc 2022Q2 2
df.dtypes
id object
date period[Q-DEC]
pwr int64
dtype: object
A convenient method to expose the missing values would be to use the complete function from pyjanitor:
# create the period ranges within a dictionary
new_values = {"date" : lambda date: pd.period_range(date.min(),
periods = 4)}
new_values = [new_values]
# pip install pyjanitor
import janitor
import pandas as pd
df.complete(new_values, by='id')
id date pwr
0 aa 2022Q1 10.0
1 aa 2022Q1 1.0
2 aa 2022Q2 1.0
3 aa 2022Q2 5.0
4 aa 2022Q3 NaN
5 aa 2022Q4 NaN
6 bb 2022Q1 5.0
7 bb 2022Q1 1.0
8 bb 2022Q2 1.0
9 bb 2022Q2 1.0
10 bb 2022Q3 NaN
11 bb 2022Q4 NaN
12 cc 2022Q1 2.0
13 cc 2022Q2 2.0
14 cc 2022Q3 NaN
15 cc 2022Q4 NaN
Sticking to Pandas only, let's get the unique rows from df
:
temp = df.drop_duplicates(['id', 'date'])
The reason for the unique values is so that we can create period ranges per id
and reindex (reindex does not work with non unique indices):
temp = (temp
.set_index('date')
.groupby('id')
.apply(lambda df: df.reindex(pd.period_range(df.index.min(),
periods = 4)))
.drop(columns=['id', 'pwr'])
)
Let's join temp
back to df
:
In [119]: df.merge(temp,
left_on=['id', 'date'],
right_index = True,
how = 'right')
Out[119]:
id date pwr
0 aa 2022Q1 10.0
1 aa 2022Q1 1.0
2 aa 2022Q2 1.0
3 aa 2022Q2 5.0
9 aa 2022Q3 NaN
9 aa 2022Q4 NaN
4 bb 2022Q1 5.0
5 bb 2022Q1 1.0
6 bb 2022Q2 1.0
7 bb 2022Q2 1.0
9 bb 2022Q3 NaN
9 bb 2022Q4 NaN
8 cc 2022Q1 2.0
9 cc 2022Q2 2.0
9 cc 2022Q3 NaN
9 cc 2022Q4 NaN
Answered By - sammywemmy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.