Issue
I have some data like below, about a 100k rows with different batches.
batch b c
A 10 2
A 20 4
A 30 6
B 5 1
B 10 2
B 15 3
B 20 4
. . .
. . .
. . .
What I am trying to do is to is, using a formula, I want only 1 line of data frame per group with the help of a summation formula.
The formula is same throughout all batches as below
(b1xc1)+(b2xc2)+(b3xc3) 10x2 + 20x4 + 30x6 20 + 80 + 180 280
batch A = _______________________ = __________________ = _______________ = _____ = 23.33
c1 + c2 + c3 2 + 4 + 6 12 12
So the new_df will be:
batch new_value
A 23.3
B 15.0
. .
. .
. .
Please Note: there are zero values in col C, so might be possible divide by zero may occur.
With my approach I created a new column bxc and then summing up values by group and dividing by sum of c for that group, but I think it is not much effective way. Is there an alternate way to achieve this? Thank you :)
Solution
To make it easier to follow I am recreating sample data
as follows:
import pandas as pd
data = {'batch': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [10, 20, 30, 5, 10, 15, 20],
'c': [2, 4, 6, 1, 2, 3, 4]}
df = pd.DataFrame(data)
Then I would recommend using a custom function to calculate the value and to avoid division by zero:
def calculate_new_value(group):
numerator = (group['b'] * group['c']).sum()
denominator = group['c'].sum()
if denominator == 0:
return 0
return round(numerator / denominator, 1)
Applying this function with apply
after groupby
and setting your index with reset_index
to new_value
results in:
print(df.groupby('batch').apply(calculate_new_value).reset_index(name='new_value'))
batch new_value
0 A 23.3
1 B 15.0
Answered By - sm3sher
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.