Monday, August 29, 2022

[FIXED] Pandas how to perform the inner calculation in the dataset?

August 29, 2022 pandas, python No comments

Issue

I have a dataset containing the employee id, age, and experience, I need to calcaute the age and experience based on the logic

Input dataframe

emp_id	name	age	exp	YEAR
101	Rock	20	5	2000
101	Rock	20	5	2001
101	Rock	20	5	2002
102	DED	25	10	2000
102	DED	25	10	2001
102	DED	25	10	2002
103	FEG	30	35	2000
103	FEG	30	35	2001
103	FEG	30	35	2002

Expected output

emp_id	name	age	exp	YEAR	update_age	upadted_exp
101	Rock	20	5	2000	18	3
101	Rock	20	5	2001	19	4
101	Rock	20	5	2002	20	5
102	DED	25	10	2000	23	8
102	DED	25	10	2001	24	9
102	DED	25	10	2002	25	10
103	FEG	30	35	2000	28	33
103	FEG	30	35	2001	29	34
103	FEG	30	35	2002	30	35

Explanation

Logic 1- If the employee has 3 records with age and experience, needs to subtract the last record values from the above 2. Emp_id is unique.

Second Module

I have below data frame based on the max year I need to update the year column. ie - year emp_id 101 have 3 records, max year = 2001 so output column contains 2001,2000,1999

emp_id	name	year
101	Rock	2000
101	Rock	2000
101	Rock	2001
102	DED	2002
102	DED	2002
102	DED	2000
103	FEG	2000
103	FEG	2000
103	FEG	2000

Output

emp_id	name	year	upadte_year
101	Rock	2000	1999
101	Rock	2000	2000
101	Rock	2001	2001
102	DED	2002	2000
102	DED	2002	2001
102	DED	2000	2002
103	FEG	2000	1998
103	FEG	2000	1999
103	FEG	2000	2000

Solution

You can achieve the desired output as follows. Use df.groupby with emp_id and retrieve the cumcount for each group in descending order. Next, subtract the result from cols age and exp to get your updated columns.

import pandas as pd

data = {'emp_id': {0: 101, 1: 101, 2: 101, 3: 102, 4: 102, 5: 102, 
                   6: 103, 7: 103, 8: 103}, 
        'name': {0: 'Rock', 1: 'Rock', 2: 'Rock', 3: 'DED', 4: 'DED', 
                 5: 'DED', 6: 'FEG', 7: 'FEG', 8: 'FEG'}, 
        'age': {0: 20, 1: 20, 2: 20, 3: 25, 4: 25, 5: 25, 6: 30, 
                7: 30, 8: 30}, 
        'exp': {0: 5, 1: 5, 2: 5, 3: 10, 4: 10, 5: 10, 6: 35, 
                7: 35, 8: 35}, 
        'YEAR': {0: 2000, 1: 2001, 2: 2002, 3: 2000, 4: 2001, 
                 5: 2002, 6: 2000, 7: 2001, 8: 2002}
        }

df = pd.DataFrame(data)

cumcount = df.groupby('emp_id').cumcount(ascending=False)

df['updated_age'] = df['age'].sub(cumcount)
df['updated_exp'] = df['exp'].sub(cumcount)

print(df)

   emp_id  name  age  exp  YEAR  updated_age  updated_exp
0     101  Rock   20    5  2000           18            3
1     101  Rock   20    5  2001           19            4
2     101  Rock   20    5  2002           20            5
3     102   DED   25   10  2000           23            8
4     102   DED   25   10  2001           24            9
5     102   DED   25   10  2002           25           10
6     103   FEG   30   35  2000           28           33
7     103   FEG   30   35  2001           29           34
8     103   FEG   30   35  2002           30           35

Update: as for the second question added later: to add the column updated_year to the second df based on the max in column year for each emp_id, try as follows:

df['updated_year'] = df.groupby('emp_id')['year'].transform(max)\
    .sub(df.groupby('emp_id').cumcount(ascending=False))

print(df)

   emp_id  name  year  updated_year
0     101  Rock  2000          1999
1     101  Rock  2000          2000
2     101  Rock  2001          2001
3     102   DED  2002          2000
4     102   DED  2002          2001
5     102   DED  2000          2002
6     103   FEG  2000          1998
7     103   FEG  2000          1999
8     103   FEG  2000          2000

Answered By - ouroboros1

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, August 29, 2022

[FIXED] Pandas how to perform the inner calculation in the dataset?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels