Issue
I have the following dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name' : ['Jake', 'Nate', '', 'Alex', '', 'Max', 'Nate', 'Jake'],
'Color' : ['', 'red;blue', 'blue;pink', 'green;blue;red', '', '', 'blue', 'red;yellow'],
'Value_1' : [1211233.419, 4007489.726, 953474.6894, np.NaN, 1761987.704, 222600361, 404419.2243, 606066.067 ],
'Value_2' : [np.NaN, 1509907.457, 4792269.911, 43486.59312, np.NaN, np.NaN, 2066645.251, 60988660.37],
'Value_3' : [1175299.998, np.NaN, 1888559.459, np.NaN, 444689.0177, 405513.0572, 343704.0269, 2948494.383]})
---
Name Color Value_1 Value_2 Value_3
0 Jake 1.211233e+06 NaN 1.175300e+06
1 Nate red;blue 4.007490e+06 1.509907e+06 NaN
2 blue;pink 9.534747e+05 4.792270e+06 1.888559e+06
3 Alex green;blue;red NaN 4.348659e+04 NaN
4 1.761988e+06 NaN 4.446890e+05
5 Max 2.226004e+08 NaN 4.055131e+05
6 Nate blue 4.044192e+05 2.066645e+06 3.437040e+05
7 Jake red;yellow 6.060661e+05 6.098866e+07 2.948494e+06
I need two things:
1)In the first case I need to add all the values (Value_1, Value_2, Value_3) where I have the same name and get for example:
Name Value_1 Value_2 Value_3
0 Jake 1.817299e+06 6.098866e+07 4.123794e+06
1 Nate 4.411909e+06 3.576553e+06 3.437040e+05
2 Alex NaN 4.348659e+04 NaN
3 Max 2.226004e+08 NaN 4.055131e+05
2)I need the same thing but with the values of the name column plus the splits of the color column (only if there is at least one name and one color in the same row):
Name Color Value_1 Value_2 Value_3
0 Alex green NaN 4.348659e+04 NaN
1 Alex blue NaN 4.348659e+04 NaN
3 Alex red NaN 4.348659e+04 NaN
4 Jake red 6.060661e+05 6.098866e+07 2.948494e+06
5 Jake yellow 6.060661e+05 6.098866e+07 2.948494e+06
6 Nate red 4.007490e+06 1.509907e+06 NaN
7 Nate blue 4.411909e+06 3.576553e+06 3.437040e+05
(Note that in this case the only line present twice is Nate-Blue)
Solution
First replace empty strings in first 2 columns to mising values:
df1[['Name','Color']] = df1[['Name','Color']].replace('', np.nan)
Then aggregate sum
with min_count=1
for missing values instead 0
:
df2 = df1.groupby('Name', as_index=False).sum(min_count=1)
print (df2)
Name Value_1 Value_2 Value_3
0 Alex NaN 4.348659e+04 NaN
1 Jake 1.817299e+06 6.098866e+07 4.123794e+06
2 Max 2.226004e+08 NaN 4.055131e+05
3 Nate 4.411909e+06 3.576553e+06 3.437040e+05
For second ouput first use Series.str.split
with DataFrame.explode
and then aggregate sum
:
df3 = (df1.assign(Color=df1['Color'].str.split(';'))
.explode('Color')
.groupby(['Name', 'Color'], as_index=False)
.sum(min_count=1))
print (df3)
Name Color Value_1 Value_2 Value_3
0 Alex blue NaN 4.348659e+04 NaN
1 Alex green NaN 4.348659e+04 NaN
2 Alex red NaN 4.348659e+04 NaN
3 Jake red 6.060661e+05 6.098866e+07 2.948494e+06
4 Jake yellow 6.060661e+05 6.098866e+07 2.948494e+06
5 Nate blue 4.411909e+06 3.576553e+06 3.437040e+05
6 Nate red 4.007490e+06 1.509907e+06 NaN
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.