Sunday, June 26, 2022

[FIXED] How to SUM based on criteria in Pandas

June 26, 2022 pandas, python No comments

Issue

My CSV sample data is

ID,Project,From,To,Percentage
1,APPLE,01-01-2022,31-03-2022,50
1,MICROSOFT,01-01-2022,15-01-2022,50
1,MICROSOFT,01-02-2022,28-02-2022,50
1,MICROSOFT,01-03-2022,31-03-2022,50
2,ORACLE,01-02-2022,23-06-2022,50
3,APPLE,23-04-2022,23-06-2022,100
1,MICROSOFT,16-01-2022,31-01-2022,50
2,DELL,01-12-2021,01-04-2022,50

My objective is to see if on any date any employee has been allocated for more than 100% in a project. As you can see employee id 1 has been allocated 1/Jan/22 to 31/Mar/22 for 50% in APPLE and multiple sub-periods in MICROSOFT for 50% itself but it isn't for any period greater than 100% allocation.

How when I sum(percentage) using this criteria (StartA <= EndB) and (EndA >= StartB)

I get 250% for the first line item because it will match with all other line times for ID=1. The period is engulfing all the smaller periods.

Is there a way to check if any employee has been allocated for more than 100% on any given date?

Edit: EXPECTED RESULT

Solution

One of your problems is your expect result given your data would be rather lame. So I added one more line:

  ID    Project       From         To Percentage
0  1      APPLE 2022-01-01 2022-03-31         50
1  1  MICROSOFT 2022-01-01 2022-01-15         50
2  1  MICROSOFT 2022-02-01 2022-02-28         50
3  1  MICROSOFT 2022-03-01 2022-03-31         50
4  2     ORACLE 2022-02-01 2022-06-23         50
5  3      APPLE 2022-04-23 2022-06-23        100
6  1  MICROSOFT 2022-01-16 2022-01-31         50
7  2       DELL 2021-12-01 2022-04-01         50
0  1       Home 2022-01-01 2022-01-07         75

Doing:

# Format dates as proper datetimes.
df['From'] = pd.to_datetime(df['From'], format='%d-%m-%Y')
df['To'] = pd.to_datetime(df['To'], format='%d-%m-%Y')

# Create the date Range:
df['dates'] = df.apply(lambda row: pd.date_range(row.From, row.To, freq='D'), axis=1)

# Explode the dates:
df = df.explode('dates')

# Find the dates where someone is over allocated:
out = df.groupby(['ID', 'dates'])['Percentage'].sum()[lambda x: x.gt(100)]
print(out.reset_index())

Output:

   ID      dates Percentage
0   1 2022-01-01        175
1   1 2022-01-02        175
2   1 2022-01-03        175
3   1 2022-01-04        175
4   1 2022-01-05        175
5   1 2022-01-06        175
6   1 2022-01-07        175

To put it back into your original data format:

df['allocation'] = df.groupby(['ID', 'dates'])['Percentage'].transform('sum')
df = df.groupby(['ID', 'Project', 'From', 'To', 'Percentage'], as_index=False)['allocation'].max()

Output:

   ID    Project       From         To  Percentage  allocation
0   1      APPLE 2022-01-01 2022-03-31          50         175
1   1       Home 2022-01-01 2022-01-07          75         175
2   1  MICROSOFT 2022-01-01 2022-01-15          50         175
3   1  MICROSOFT 2022-01-16 2022-01-31          50         100
4   1  MICROSOFT 2022-02-01 2022-02-28          50         100
5   1  MICROSOFT 2022-03-01 2022-03-31          50         100
6   2       DELL 2021-12-01 2022-04-01          50         100
7   2     ORACLE 2022-02-01 2022-06-23          50         100
8   3      APPLE 2022-04-23 2022-06-23         100         100

Since there appears to be some confusion about how this works, here's a poor visualization of the problem:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, figsize=(16,6))

for group in df.groupby('ID'):
    temp_df = group[1]
    ax.barh(y=temp_df.Project, width=temp_df.days_start_to_end, left=temp_df.start_num, label=temp_df.ID.unique())

ax.legend()
plt.show()

Output:

This doesn't properly show per-project percentages, but it does show that at certain time periods ID 1 has been assigned to three different projects. In this case we've calculated that from 1/1 to 1/7 they're overbooked at 175%. The original formatting of the table is unable to show this, so the best we can do is display the max utilization during the given row's time period.

Answered By - BeRT2me

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, June 26, 2022

[FIXED] How to SUM based on criteria in Pandas

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels