Issue
My CSV sample data is
ID,Project,From,To,Percentage
1,APPLE,01-01-2022,31-03-2022,50
1,MICROSOFT,01-01-2022,15-01-2022,50
1,MICROSOFT,01-02-2022,28-02-2022,50
1,MICROSOFT,01-03-2022,31-03-2022,50
2,ORACLE,01-02-2022,23-06-2022,50
3,APPLE,23-04-2022,23-06-2022,100
1,MICROSOFT,16-01-2022,31-01-2022,50
2,DELL,01-12-2021,01-04-2022,50
My objective is to see if on any date any employee has been allocated for more than 100% in a project. As you can see employee id 1 has been allocated 1/Jan/22 to 31/Mar/22 for 50% in APPLE and multiple sub-periods in MICROSOFT for 50% itself but it isn't for any period greater than 100% allocation.
How when I sum(percentage) using this criteria (StartA <= EndB) and (EndA >= StartB)
I get 250% for the first line item because it will match with all other line times for ID=1. The period is engulfing all the smaller periods.
Is there a way to check if any employee has been allocated for more than 100% on any given date?
Solution
One of your problems is your expect result given your data would be rather lame. So I added one more line:
ID Project From To Percentage
0 1 APPLE 2022-01-01 2022-03-31 50
1 1 MICROSOFT 2022-01-01 2022-01-15 50
2 1 MICROSOFT 2022-02-01 2022-02-28 50
3 1 MICROSOFT 2022-03-01 2022-03-31 50
4 2 ORACLE 2022-02-01 2022-06-23 50
5 3 APPLE 2022-04-23 2022-06-23 100
6 1 MICROSOFT 2022-01-16 2022-01-31 50
7 2 DELL 2021-12-01 2022-04-01 50
0 1 Home 2022-01-01 2022-01-07 75
Doing:
# Format dates as proper datetimes.
df['From'] = pd.to_datetime(df['From'], format='%d-%m-%Y')
df['To'] = pd.to_datetime(df['To'], format='%d-%m-%Y')
# Create the date Range:
df['dates'] = df.apply(lambda row: pd.date_range(row.From, row.To, freq='D'), axis=1)
# Explode the dates:
df = df.explode('dates')
# Find the dates where someone is over allocated:
out = df.groupby(['ID', 'dates'])['Percentage'].sum()[lambda x: x.gt(100)]
print(out.reset_index())
Output:
ID dates Percentage
0 1 2022-01-01 175
1 1 2022-01-02 175
2 1 2022-01-03 175
3 1 2022-01-04 175
4 1 2022-01-05 175
5 1 2022-01-06 175
6 1 2022-01-07 175
To put it back into your original data format:
df['allocation'] = df.groupby(['ID', 'dates'])['Percentage'].transform('sum')
df = df.groupby(['ID', 'Project', 'From', 'To', 'Percentage'], as_index=False)['allocation'].max()
Output:
ID Project From To Percentage allocation
0 1 APPLE 2022-01-01 2022-03-31 50 175
1 1 Home 2022-01-01 2022-01-07 75 175
2 1 MICROSOFT 2022-01-01 2022-01-15 50 175
3 1 MICROSOFT 2022-01-16 2022-01-31 50 100
4 1 MICROSOFT 2022-02-01 2022-02-28 50 100
5 1 MICROSOFT 2022-03-01 2022-03-31 50 100
6 2 DELL 2021-12-01 2022-04-01 50 100
7 2 ORACLE 2022-02-01 2022-06-23 50 100
8 3 APPLE 2022-04-23 2022-06-23 100 100
Since there appears to be some confusion about how this works, here's a poor visualization of the problem:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, figsize=(16,6))
for group in df.groupby('ID'):
temp_df = group[1]
ax.barh(y=temp_df.Project, width=temp_df.days_start_to_end, left=temp_df.start_num, label=temp_df.ID.unique())
ax.legend()
plt.show()
Output:
This doesn't properly show per-project percentages, but it does show that at certain time periods ID 1 has been assigned to three different projects. In this case we've calculated that from 1/1 to 1/7 they're overbooked at 175%. The original formatting of the table is unable to show this, so the best we can do is display the max utilization during the given row's time period.
Answered By - BeRT2me
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.