Issue
I have big data and I need to do imputation to fill out missing values. but I need to do imputation using groupby to fill missing values based on another column's value. let's say I have this table
data = {'name': ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
'age': [27, 32, 78,27, 32, 78],
'weight': [160, 209, 130,164, 206, 132],
'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
}
df = pd.DataFrame(data) df
I add data for the next day, but as null values, and I end up with this table:
|name |job |age|weight |date
|---|-------|-----------|---|-------|--------
|0 |Alex |teacher |27 |160 |6-12-2022
|1 |Ben |doctor |32 |209 |6-12-2022
|2 |Marry |engineer |78 |130 |6-12-2022
|3 |Alex |teacher |27 |164 |6-13-2022
|4 |Ben |doctor |32 |206 |6-13-2022
|5 |Marry |engineer |78 |132 |6-13-2022
|6 |Alex |teacher |NaN|NaN |6-14-2022
|7 |Ben |doctor |NaN|NaN |6-14-2022
|8 |Marry |engineer |NaN|NaN |6-14-2022
now if I want to fill out values for "age" and "weight" columns using KNNImputer or any other imputer, how can I do that while I groupby "name"? As I mentioned the dataframe is a sample and I am dealing with big data Appreciate any helps
Thanks
Solution
@mnm. The problem you posted is not straightforward to solve. The main starting problem is the fact that all numerical values are NaN and all remaining values are strings. The latter is rather inconvenient for KNN because there is no intuitive distance between strings.
I will (1) use a one-hot encoding for the job variable, and (2) label the consecutive days by 1, 2, and 3. Overall, this results in the following training and test data:
TRAINING
name teacher doctor engineer age weight date
0 Alex 1 0 0 27 160 1
1 Ben 0 1 0 32 209 1
2 Marry 0 0 1 78 130 1
3 Alex 1 0 0 27 164 2
4 Ben 0 1 0 32 206 2
5 Marry 0 0 1 78 132 2
TEST
name teacher doctor engineer age weight date
0 Alex 1 0 0 NaN NaN 3
1 Ben 0 1 0 NaN NaN 3
2 Marry 0 0 1 NaN NaN 3
Clearly, this is not the only way to recode the job
and date
variable into numerical values. For example, you can easily ensure that consecutive days are only 1/365 apart. This essentially gives less weight to date differences when running the nearest neighbors algorithm.
The code below applies KNN to insert a single missing value into the table. To impute all missing observations:
- Transform the code underneath "NEAREST NEIGHBORS" into a function.
- Loop through the data the test data, identify a NaN, and impute using this function.
I hope this helps...
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
############
### DATA ###
############
TrainingData = { 'name': ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
'teacher': [1,0,0,1,0,0],
'doctor': [0,1,0,0,1,0],
'engineer': [0,0,1,0,0,1],
'age': [27, 32, 78,27, 32, 78],
'weight': [160, 209, 130,164, 206, 132],
'date': [1,1,1,2,2,2]}
TestData = {'name': ['Alex', 'Ben', 'Marry'],
'teacher': [1,0,0],
'doctor': [0,1,0],
'engineer': [0,0,1],
'age': [np.NaN,np.NaN,np.NaN],
'weight': [np.NaN,np.NaN,np.NaN],
'data': [3,3,3]}
# Convert to pandas dataframe
dfTraining = pd.DataFrame(TrainingData)
dfTest = pd.DataFrame(TestData)
# Print
print(dfTraining)
print(dfTest)
#########################
### NEAREST NEIGHBORS ###
#########################
Variable2Predict = 'weight'
Person2Predict = 'Marry'
#### Step 1: Determine training data
# Select rows of beloning to Person2Predict
dfSubSelectedByName = dfTraining.loc[dfTraining['name'] == Person2Predict]
# Assign dependent variable as the variable to predict
Y = dfSubSelectedByName[Variable2Predict].to_numpy()
# Nearest neighbor computation should be based on all columns in the data except 'name' and the variable to predict
X = dfSubSelectedByName[dfSubSelectedByName.columns[~dfSubSelectedByName.columns.isin(['name', Variable2Predict])]].to_numpy()
print(X)
### Step 2: Nearest neighbor prediction
K = 1
# Read row from pandas dataframe based on name
Row2Predict = dfTest.loc[dfTest['name'] == Person2Predict]
# Select all variable except for name and dependent variable
Xpred = Row2Predict[Row2Predict.columns[~Row2Predict.columns.isin(['name', Variable2Predict])]].to_numpy()
# Step 3: K-NN based on the observed variables
PresentVariables = ~np.isnan(Xpred)
NumberOfPresentVariables = np.sum(PresentVariables)
NearestNeighbor = KNeighborsRegressor(n_neighbors=K).fit(X[:, PresentVariables[0]], Y)
Yhat = NearestNeighbor.predict( np.reshape( Xpred[PresentVariables], (1,NumberOfPresentVariables) ) )
print(Yhat)
Answered By - Hanno Reuvers
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.