Issue
Follow up to: How do I sum a column based on separate category types, and preserve the zeros?
I used 'Id' as a stand in for a bunch of columns. One of them is 'Date'. Here is my dataframe after applying the pivot table method.
Id | Date | A_Hours | B_Hours |
---|---|---|---|
1 | 2/2/2022 | 4 | 4 |
2 | 2/4/2023 | 2 | 6 |
3 | 6/2/2023 | 3 | 0 |
But because 'Date' was one of the columns I added to the index with 'Id', here is the dataframe for a user with multiple hours entered for multiple dates.
Id | Date | A_Hours | B_Hours |
---|---|---|---|
1 | 2/2/2022 | 4 | 4 |
2 | 2/3/2023 | 3 | 3 |
2 | 2/4/2023 | 2 | 6 |
3 | 6/2/2023 | 3 | 0 |
How do I get the following dataframe:
Id | 2/3/2023_A_Hours | 2/4/2023_B_Hours | 6/2/2023_A_Hours | 6/2/2023_B_Hours |
---|---|---|---|---|
1 | 0 | 0 | 0 | 0 |
2 | 3 | 3 | 2 | 6 |
3 | 0 | 0 | 3 | 0 |
Note that there is an additional filter placed on top of this - Id 1 has all 0s because in our result dataframe, not only are we combining rows into multiple columns for a single id, we only want to calculate hours entered in 2023. (and exclude all dates < 1/1/2023 in the count)
Also note that this time, I don't want to create a new column name '2/4/2023_B_Hours'. I would actually prefer a multi-level index with "A_hours" and "B_hours" under "2/4/2023". The table in StackOverflow doesn't allow me to format it like that.
Solution
You should provide a reproducible example for clarity.
Assuming this one, we want to consider dates after 2023-01-15
(which are absent in Id 3):
Id Date Category Hours
0 1 2023-01-01 A 1 # this should be filtered out
1 1 2023-02-01 A 3
2 1 2023-02-01 B 4
3 2 2023-03-01 A 2
4 2 2023-03-01 B 6
5 3 2023-01-01 A 3 # this should be filtered out
You can pre-filter the rows (for example using query
), then reindex
after pivoting to add the missing Ids:
out = (df.query('Date > "2023-01-15"')
.pivot_table(index='Id', columns='Category', values='Hours',
aggfunc='sum', fill_value=0,
margins=True, margins_name='Total')
.add_suffix('_Hours')
.drop('Total')
.reindex(df['Id'].unique(), fill_value=0)
.reset_index().rename_axis(columns=None)
)
Output:
Id A_Hours B_Hours Total_Hours
0 1 3 4 7
1 2 2 6 8
2 3 0 0 0
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.