Issue
I want to sum all cells (logged working time, in seconds--I can handle the reformatting to HH:MM later when the summing is complete) in a pandas dataframe row to create a "total_hours" column.
- The table has an id column, so my code (hopefully) avoids including the first column in the summing.
- the rows include int, str, and list values. I used the "numeric_only=True" parameter to avoid including non-ints.
- I read somewhere transposing might help. It didn't.
- Screenshot of the table attached: https://i.stack.imgur.com/NqTzr.png
My code:
def final_output(df):
df = pd.concat([df,pd.DataFrame(list(df['daily_logs']))],axis=1).drop(columns='daily_logs')
df['total_hours'] = df.iloc[:,1:].sum(axis=1,numeric_only = True) #this is the part that isn't working as intended.
return df
df_results = final_output(df)
df_results
The data and output are below. The df without "total_hours" is the input. The output looks fine, except the "total_hours" should actually represent a sum of all the integers in the row, and not yield 0.0.
id | name | 2023-11-28 | 2023-11-29 | 2023-11-30 | 2023-12-01 | 2023-12-02 | 2023-12-03 | 2023-12-04 | 2023-12-05 | total_hours | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 514 | alexis | No hours logged | No hours logged | No hours logged | No hours logged | No hours logged | No hours logged | No hours logged | No hours logged | 0.0 |
1 | 2414 | donald | 29160 | 29160 | 28860 | 28620 | [10:02, Odd number: missing entry] | No hours logged | No hours logged | 28860 | 0.0 |
2 | 3225 | mackenzie | 28500 | 28620 | 28560 | 28620 | 28980 | No hours logged | No hours logged | 28740 | 0.0 |
3 | 3434 | louisa | 25380 | 18840 | 26040 | No hours logged | 25560 | No hours logged | No hours logged | 24900 | 0.0 |
4 | 3864 | olga | 15600 | 19080 | 14400 | 13800 | 5220 | 17820 | No hours logged | 14400 | 0.0 |
5 | 4716 | rick | 30180 | 28800 | 28740 | 28620 | 28680 | No hours logged | No hours logged | 28680 | 0.0 |
6 | 5793 | roberta | 29220 | 29220 | 29460 | 29280 | 29340 | No hours logged | No hours logged | 29040 | 0.0 |
Solution
As per the documentation of df.sum
, the parameter numeric_only
is used to:
[i]nclude only float, int, boolean columns.
All the columns you are selecting contain at least one string, hence all of them are excluded from the calculation, leading to the sum of an empty Series (default: 0
) for each row.
What you are trying to do, can be achieved by first applying pd.to_numeric
to each of the columns selected, with the errors
parameter set to coerce
, causing invalid parsing to result in NaN
.
# N.B. start at `df.iloc[:, 2:]`, not `1:`; "id" is the first column
df['total_hours'] = (
df.iloc[:, 2:]
.apply(lambda x: pd.to_numeric(x, errors='coerce'))
.sum(axis=1)
)
df['total_hours']
0 0.0
1 144660.0
2 172020.0
3 120720.0
4 100320.0
5 173700.0
6 175560.0
Name: total_hours, dtype: float64
To adjust the formatting, you could use pd.to_timedelta
:
pd.to_timedelta(df['total_hours'], unit='s')
0 0 days 00:00:00
1 1 days 16:11:00
2 1 days 23:47:00
3 1 days 09:32:00
4 1 days 03:52:00
5 2 days 00:15:00
6 2 days 00:46:00
Name: total_hours, dtype: timedelta64[ns]
Data used
import pandas as pd
data = {
'id': {0: 514, 1: 2414, 2: 3225, 3: 3434, 4: 3864, 5: 4716, 6: 5793},
'name': {0: 'alexis', 1: 'donald', 2: 'mackenzie', 3: 'louisa', 4: 'olga', 5: 'rick', 6: 'roberta'},
'2023-11-28': {0: 'No hours logged', 1: '29160', 2: '28500', 3: '25380', 4: '15600', 5: '30180', 6: '29220'},
'2023-11-29': {0: 'No hours logged', 1: '29160', 2: '28620', 3: '18840', 4: '19080', 5: '28800', 6: '29220'},
'2023-11-30': {0: 'No hours logged', 1: '28860', 2: '28560', 3: '26040', 4: '14400', 5: '28740', 6: '29460'},
'2023-12-01': {0: 'No hours logged', 1: '28620', 2: '28620', 3: 'No hours logged', 4: '13800', 5: '28620', 6: '29280'},
'2023-12-02': {0: 'No hours logged', 1: '[10:02, Odd number: missing entry]', 2: '28980', 3: '25560', 4: '5220', 5: '28680', 6: '29340'},
'2023-12-03': {0: 'No hours logged', 1: 'No hours logged', 2: 'No hours logged', 3: 'No hours logged', 4: '17820', 5: 'No hours logged', 6: 'No hours logged'},
'2023-12-04': {0: 'No hours logged', 1: 'No hours logged', 2: 'No hours logged', 3: 'No hours logged', 4: 'No hours logged', 5: 'No hours logged', 6: 'No hours logged'},
'2023-12-05': {0: 'No hours logged', 1: '28860', 2: '28740', 3: '24900', 4: '14400', 5: '28680', 6: '29040'},
}
df = pd.DataFrame(data)
Answered By - ouroboros1
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.