Issue
I have below contents in csv file:
key1 key2 Key3 key4 key5
Val1 A 51 'True' 25
Val1 A 50 'False' 25
Val1 A 49 'True' 25
Val1 A 48 'True' 25
Val2 A 47 'False' 25
Val2 A 46 'True' 25
Val2 A 45 'False' 25
Val2 A 44 'True' 25
Val2 A 43 'True' 25
Output should be like:
key1 key2 max_key5 total_key4 total_true_key4 grade
Val1 A 51 4 3 1
Val2 A 47 5 3 2
I have to group by key1 and key2 then find the max of key5 and count of total rows of key4 and count of total true rows of key4 and then percentage.
What I am trying:
grd = "1 if avg > 70 else 2 if avg > 50 else c"
pct = lambda x: (1 if x > 70 else (2 if x > 50 else 3))
json_data
.assign(_key4=lambda df_: df_['key4'] == "'True'")
.groupby(['key1', 'key2'])
.agg(
maxkey5=('key5', 'max'),
total_key4=('key4', 'count'),
total_true_key4=('_key4', 'sum')
)
.eval('avg = (total_true_key4 * 100) / total_key4')
.eval('feg = grd')
#.apply(pct(avg))
eval to calculate percentage is working fine.. not able to do if else if on avg column
I don't want to apply separately like in another separate statement.
Solution
As mentioned in the comments, I am not quite sure how you would get your desired output with your logic. Also, you define maxkey5=('key5', 'max')
but actually you are getting the max
of key3
. So, if I am not mistaken, you are looking for something like this:
import pandas as pd
from numpy import mean
df = pd.DataFrame({
"key1": ["Val1"]*4+["Val2"]*5,
"key2": ["A"]*9,
"key3": [51, 50, 49, 48, 47, 46, 45, 44, 43],
"key4": ["'True'", "'False'", "'True'", "'True'", "'False'", "'True'", "'False'", "'True'", "'True'"],
"key5": [25] * 9
})
(
df
.assign(total_true_key4=df['key4']=="'True'", grade=df['key4']=="'True'")
.groupby(["key1", "key2"])[["key3", "key4", "total_true_key4", "grade"]]
.agg(
{"key3": "max",
"key4": "count",
"total_true_key4": "sum",
"grade": lambda x: 1 if mean(x)>.7 else (2 if mean(x)>.5 else 3)}
)
.rename(columns={"key3":"max_key5", "key4":"total_key4"}).reset_index()
)
which results into the following data frame:
key1 key2 max_key5 total_key4 total_true_key4 grade
0 Val1 A 51 4 3 1
1 Val2 A 47 5 3 2
EDIT
IIUC, you want to use your logic and just assign your grade column directly after defining your avg
column. The way to achieve this is given below:
bins = [0, 50, 70, 100]
labels = [3, 2, 1]
(
df
.assign(_key4=lambda df_: df_['key4'] == "'True'")
.groupby(['key1', 'key2'])
.agg(
maxkey5=('key5', 'max'),
total_key4=('key4', 'count'),
total_true_key4=('_key4', 'sum')
)
.eval('avg = (total_true_key4 * 100) / total_key4')
.assign(grade=lambda x: pd.cut(x.avg, bins, labels=labels))
)
Output:
maxkey5 total_key4 total_true_key4 avg grade
key1 key2
Val1 A 25 4 3 75.0 1
Val2 A 25 5 3 60.0 2
I did not change anything in your logic, besides of the last assign
statement, where I assigned a new variable named grade
using your evaluated avg
column and the pd.cut
method for pre-defined labels and bins.
Answered By - ko3
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.