Wednesday, January 3, 2024

[FIXED] Summing across a row in Pandas DataFrame and adding a Total column yields 0.0

January 03, 2024 dataframe, pandas, python No comments

Issue

I want to sum all cells (logged working time, in seconds--I can handle the reformatting to HH:MM later when the summing is complete) in a pandas dataframe row to create a "total_hours" column.

The table has an id column, so my code (hopefully) avoids including the first column in the summing.
the rows include int, str, and list values. I used the "numeric_only=True" parameter to avoid including non-ints.
I read somewhere transposing might help. It didn't.
Screenshot of the table attached: https://i.stack.imgur.com/NqTzr.png

My code:

def final_output(df):
    df =  pd.concat([df,pd.DataFrame(list(df['daily_logs']))],axis=1).drop(columns='daily_logs')
    df['total_hours'] = df.iloc[:,1:].sum(axis=1,numeric_only = True) #this is the part that isn't working as intended.
    return df

df_results = final_output(df)
df_results

The data and output are below. The df without "total_hours" is the input. The output looks fine, except the "total_hours" should actually represent a sum of all the integers in the row, and not yield 0.0.

	id	name	2023-11-28	2023-11-29	2023-11-30	2023-12-01	2023-12-02	2023-12-03	2023-12-04	2023-12-05
0	514	alexis	No hours logged	No hours logged	No hours logged	No hours logged	No hours logged	No hours logged	No hours logged	No hours logged
1	2414	donald	29160	29160	28860	28620	[10:02, Odd number: missing entry]	No hours logged	No hours logged	28860
2	3225	mackenzie	28500	28620	28560	28620	28980	No hours logged	No hours logged	28740
3	3434	louisa	25380	18840	26040	No hours logged	25560	No hours logged	No hours logged	24900
4	3864	olga	15600	19080	14400	13800	5220	17820	No hours logged	14400
5	4716	rick	30180	28800	28740	28620	28680	No hours logged	No hours logged	28680
6	5793	roberta	29220	29220	29460	29280	29340	No hours logged	No hours logged	29040

Solution

As per the documentation of df.sum, the parameter numeric_only is used to:

[i]nclude only float, int, boolean columns.

All the columns you are selecting contain at least one string, hence all of them are excluded from the calculation, leading to the sum of an empty Series (default: 0) for each row.

What you are trying to do, can be achieved by first applying pd.to_numeric to each of the columns selected, with the errors parameter set to coerce, causing invalid parsing to result in NaN.

# N.B. start at `df.iloc[:, 2:]`, not `1:`; "id" is the first column

df['total_hours'] = (
    df.iloc[:, 2:]
      .apply(lambda x: pd.to_numeric(x, errors='coerce'))
      .sum(axis=1)
)

df['total_hours']

0         0.0
1    144660.0
2    172020.0
3    120720.0
4    100320.0
5    173700.0
6    175560.0
Name: total_hours, dtype: float64

To adjust the formatting, you could use pd.to_timedelta:

pd.to_timedelta(df['total_hours'], unit='s')

0   0 days 00:00:00
1   1 days 16:11:00
2   1 days 23:47:00
3   1 days 09:32:00
4   1 days 03:52:00
5   2 days 00:15:00
6   2 days 00:46:00
Name: total_hours, dtype: timedelta64[ns]

Data used

import pandas as pd

data = {
    'id': {0: 514, 1: 2414, 2: 3225, 3: 3434, 4: 3864, 5: 4716, 6: 5793},
    'name': {0: 'alexis', 1: 'donald', 2: 'mackenzie', 3: 'louisa', 4: 'olga', 5: 'rick', 6: 'roberta'},
    '2023-11-28': {0: 'No hours logged', 1: '29160', 2: '28500', 3: '25380', 4: '15600', 5: '30180', 6: '29220'},
    '2023-11-29': {0: 'No hours logged', 1: '29160', 2: '28620', 3: '18840', 4: '19080', 5: '28800', 6: '29220'},
    '2023-11-30': {0: 'No hours logged', 1: '28860', 2: '28560', 3: '26040', 4: '14400', 5: '28740', 6: '29460'},
    '2023-12-01': {0: 'No hours logged', 1: '28620', 2: '28620', 3: 'No hours logged', 4: '13800', 5: '28620', 6: '29280'},
    '2023-12-02': {0: 'No hours logged', 1: '[10:02, Odd number: missing entry]', 2: '28980', 3: '25560', 4: '5220', 5: '28680', 6: '29340'},
    '2023-12-03': {0: 'No hours logged', 1: 'No hours logged', 2: 'No hours logged', 3: 'No hours logged', 4: '17820', 5: 'No hours logged', 6: 'No hours logged'},
    '2023-12-04': {0: 'No hours logged', 1: 'No hours logged', 2: 'No hours logged', 3: 'No hours logged', 4: 'No hours logged', 5: 'No hours logged', 6: 'No hours logged'},
    '2023-12-05': {0: 'No hours logged', 1: '28860', 2: '28740', 3: '24900', 4: '14400', 5: '28680', 6: '29040'},
}

df = pd.DataFrame(data)

Answered By - ouroboros1

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 3, 2024

[FIXED] Summing across a row in Pandas DataFrame and adding a Total column yields 0.0

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels