Issue
I have a leave dataset of partners with leave start date and end date, duration of leaves and Last Working Date (LWD). I need to find the sum of leaves for each partner availed four weeks from LWD grouped in each week interval from LWD. Week1 may be considered 7 days from LWD, week2 as the next 7 days and so on.
EDIT: The aim is to find out the number of leaves each partner availed in each of the last four weeks till their departure from the company
Dataset example below, dates are in dd/mm/yyyy format
I'm looking for an outcome such as:
I understand there would be a groupby
followed by datetime.timedelta(days = 7)
to get to the dates from LWD but confused as to arrive at the final outcome. Any help appreciated. Please note that the weekly sums are not cumulative, only for the span of the specific week
import pandas as pd
df = pd.DataFrame({'EID':[75161,75162,75162,75162,75162,75166,75166,75166,75169,75170],
'START_DATE':['30/08/21','01/10/21','18/06/21','12/11/21','14/06/21','22/04/21','22/07/21','23/08/21','24/08/21','25/10/21'],
'END_DATE':['30/08/21','01/10/21','18/06/21','12/11/21','14/06/21','23/04/21','23/07/21','23/08/21','26/08/21','25/10/21'],
'LWD':['30/08/21','13/11/21','13/11/21','13/11/21','13/11/21','13/10/21','13/10/21','13/10/21','13/10/21','13/11/21'],
'DURATION':[1,1,1,1,1,2,2,1,3,1]
})
df['START_DATE'] = pd.to_datetime(df['START_DATE'], infer_datetime_format=True)
df['END_DATE'] = pd.to_datetime(df['END_DATE'], infer_datetime_format=True)
df['LWD'] = pd.to_datetime(df['LWD'], infer_datetime_format=True)
Solution
The first thing to note about your example is you need to include the dayfirst=True argument to your statements converting date columns to pd.datetime types. as shown below:
df['START_DATE'] = pd.to_datetime(df['START_DATE'], infer_datetime_format=True, dayfirst=True)
df['END_DATE'] = pd.to_datetime(df['END_DATE'], infer_datetime_format=True, dayfirst=True)
df['LWD'] = pd.to_datetime(df['LWD'], infer_datetime_format=True, dayfirst=True)
Once you have made that change your datefields should report a consistent and correct date entry as illustrated below:
df = pd.DataFrame({'EID':[75161,75162,75162,75162,75162,75166,75166,75166,75169,75170],
'START_DATE':['30/08/21','01/10/21','18/10/21','12/11/21','14/06/21','22/04/21','22/07/21','23/08/21','24/08/21','25/10/21'],
'END_DATE':['30/08/21','01/10/21','18/10/21','12/11/21','14/06/21','23/04/21','23/07/21','23/08/21','26/08/21','25/10/21'],
'LWD':['30/08/21','13/11/21','13/11/21','13/11/21','13/11/21','13/10/21','13/10/21','13/10/21','13/10/21','13/11/21'],
'DURATION':[1,1,1,1,1,2,2,1,3,1]
})
df['START_DATE'] = pd.to_datetime(df['START_DATE'], infer_datetime_format=True, dayfirst=True)
df['END_DATE'] = pd.to_datetime(df['END_DATE'], infer_datetime_format=True, dayfirst=True)
df['LWD'] = pd.to_datetime(df['LWD'], infer_datetime_format=True, dayfirst=True)
Note: I altered some of your data to add some complexity to the example by having a single ID have leave dates in more than period of interest.
My dataframe looks like:
EID START_DATE END_DATE LWD DURATION
0 75161 2021-08-30 2021-08-30 2021-08-30 1
1 75162 2021-10-01 2021-10-01 2021-11-13 1
2 75162 2021-10-18 2021-10-18 2021-11-13 1
3 75162 2021-11-12 2021-11-12 2021-11-13 1
4 75162 2021-06-14 2021-06-14 2021-11-13 1
5 75166 2021-04-22 2021-04-23 2021-10-13 2
6 75166 2021-07-22 2021-07-23 2021-10-13 2
7 75166 2021-08-23 2021-08-23 2021-10-13 1
8 75169 2021-08-24 2021-08-26 2021-10-13 3
9 75170 2021-10-25 2021-10-25 2021-11-13 1
Now the first step is to add a column which shows the weeks before LWD in which leave has been taken as follows:
#define function to calculate timedelta in weeks between two columns
def week_diff(x: pd.datetime, y:pd.datetime) -> int:
end = x.dt.to_period('W').view(dtype='int64')
start = y.dt.to_period('W').view(dtype='int64')
return end-start
df['wks_delta'] = week_diff(df['LWD'], df['START_DATE'])
Results in:
EID START_DATE END_DATE LWD DURATION wks_delta
0 75161 2021-08-30 2021-08-30 2021-08-30 1 0
1 75162 2021-10-01 2021-10-01 2021-11-13 1 6
2 75162 2021-10-18 2021-10-18 2021-11-13 1 3
3 75162 2021-11-12 2021-11-12 2021-11-13 1 0
4 75162 2021-06-14 2021-06-14 2021-11-13 1 21
5 75166 2021-04-22 2021-04-23 2021-10-13 2 25
6 75166 2021-07-22 2021-07-23 2021-10-13 2 12
7 75166 2021-08-23 2021-08-23 2021-10-13 1 7
8 75169 2021-08-24 2021-08-26 2021-10-13 3 7
9 75170 2021-10-25 2021-10-25 2021-11-13 1 2
We can than filter this dataframe and groupby("EID", 'wks_delta') using the following:
df = df[df['wks_delta'] <= 4]
df1 = df.groupby(['EID', 'wks_delta']).sum()
df1.reset_index(inplace=True)
resulting in:
EID wks_delta DURATION
0 75161 0 1
1 75162 0 1
2 75162 3 1
3 75170 2 1
The by applying the following:
def computeLeavePeriods(prds: list, df: pd.DataFrame) -> pd.DataFrame:
row_index = list(df["EID"].unique())
rows = len(row_index)
cols = len(prds)
rslt = [[0]*cols for i in range(rows)]
for r in range(df.shape[0]):
rslt[row_index.index(df.iloc[r]['EID'])][df.iloc[r]['wks_delta']] += df.iloc[r]['DURATION']
return pd.DataFrame(data= rslt, columns=prds, index=row_index)
computeLeavePeriods(['1-LWD', '2-LWD', '3-LWD', '4-LWD'], df1)
we get the final result:
1-LWD 2-LWD 3-LWD 4-LWD
75161 1 0 0 0
75162 1 0 0 1
75170 0 0 1 0
To handle Duration values which are float, you can modify the computeLeavePeriods
function as shown below:
def computeLeavePeriods(prds: list, df: pd.DataFrame) -> pd.DataFrame:
row_index = list(df["EID"].unique())
rows = len(row_index)
cols = len(prds)
rslt = [[0]*cols for i in range(rows)]
for r in range(df.shape[0]):
rslt[row_index.index(df.iloc[r]['EID'])][int(df.iloc[r]['wks_delta'])] += df.iloc[r]['DURATION']
return pd.DataFrame(data= rslt, columns=prds, index=row_index)
Answered By - itprorh66
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.