Friday, July 15, 2022

[FIXED] resuse agg columns panda group by

July 15, 2022 dataframe, pandas, python No comments

Issue

I have below contents in csv file:

key1    key2    Key3    key4    key5

Val1    A        51     'True'  25
Val1    A        50     'False' 25
Val1    A        49     'True'  25
Val1    A        48     'True'  25
Val2    A        47     'False' 25
Val2    A        46     'True'  25
Val2    A        45     'False' 25
Val2    A        44     'True'  25
Val2    A        43     'True'  25

Output should be like:

key1 key2 max_key5 total_key4 total_true_key4 grade
Val1  A   51       4          3                1
Val2  A   47       5          3                2

I have to group by key1 and key2 then find the max of key5 and count of total rows of key4 and count of total true rows of key4 and then percentage.

What I am trying:

  grd = "1 if avg > 70 else 2 if avg > 50 else c"

    pct = lambda x: (1 if x > 70 else (2 if x > 50 else 3))



json_data
    .assign(_key4=lambda df_: df_['key4'] == "'True'")
    .groupby(['key1', 'key2'])
    .agg(
        maxkey5=('key5', 'max'), 
        total_key4=('key4', 'count'), 
        total_true_key4=('_key4', 'sum')
    )
   .eval('avg = (total_true_key4 * 100) / total_key4')
   .eval('feg = grd')
  #.apply(pct(avg))

eval to calculate percentage is working fine.. not able to do if else if on avg column

I don't want to apply separately like in another separate statement.

Solution

As mentioned in the comments, I am not quite sure how you would get your desired output with your logic. Also, you define maxkey5=('key5', 'max') but actually you are getting the max of key3. So, if I am not mistaken, you are looking for something like this:

import pandas as pd
from numpy import mean

df = pd.DataFrame({
    "key1": ["Val1"]*4+["Val2"]*5,
    "key2": ["A"]*9,
    "key3": [51, 50, 49, 48, 47, 46, 45, 44, 43],
    "key4": ["'True'", "'False'", "'True'", "'True'", "'False'", "'True'", "'False'", "'True'", "'True'"],
    "key5": [25] * 9
})


(
    df
    .assign(total_true_key4=df['key4']=="'True'", grade=df['key4']=="'True'")
    .groupby(["key1", "key2"])[["key3", "key4", "total_true_key4", "grade"]]
    .agg(
        {"key3": "max", 
         "key4": "count",
         "total_true_key4": "sum",
         "grade": lambda x: 1 if mean(x)>.7 else (2 if mean(x)>.5 else 3)}
    )
    .rename(columns={"key3":"max_key5", "key4":"total_key4"}).reset_index()
)

which results into the following data frame:

   key1   key2  max_key5  total_key4  total_true_key4  grade
0  Val1   A     51        4           3                1
1  Val2   A     47        5           3                2

EDIT

IIUC, you want to use your logic and just assign your grade column directly after defining your avg column. The way to achieve this is given below:

bins = [0, 50, 70, 100]
labels = [3, 2, 1]

(
    df
    .assign(_key4=lambda df_: df_['key4'] == "'True'")
    .groupby(['key1', 'key2'])
    .agg(
        maxkey5=('key5', 'max'), 
        total_key4=('key4', 'count'), 
        total_true_key4=('_key4', 'sum')
    )
   .eval('avg = (total_true_key4 * 100) / total_key4')
   .assign(grade=lambda x: pd.cut(x.avg, bins, labels=labels))
)

Output:

               maxkey5  total_key4  total_true_key4  avg    grade
key1    key2                    
Val1    A      25        4           3               75.0   1
Val2    A      25        5           3               60.0   2

I did not change anything in your logic, besides of the last assign statement, where I assigned a new variable named grade using your evaluated avg column and the pd.cut method for pre-defined labels and bins.

Answered By - ko3

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, July 15, 2022

[FIXED] resuse agg columns panda group by

Issue

Solution

EDIT

0 comments:

Post a Comment

Popular Posts

Labels