Issue
I am trying to use KNN for imputing categorical variables in python.
In order to do so, a typical way is to one hot encode the variables before. However sklearn OneHotEncoder() doesn't handle NAs so you need to rename them to something which creates a seperate variable.
Small reproducible example:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
#Create random pandas with categories to impute
data0 = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],["A","B"]])
original data frame:
data0
1 2
0 A NaN
1 B A
2 NaN A
3 A B
Proceed with one hot encoding:
#Rename for sklearn OHE
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = enc_missing.fit_transform(data0)
# Perform OHE:
OHE = OneHotEncoder(sparse=False)
data_OHE = OHE.fit_transform(data1)
Data_OHE is now one hot encoded:
Data_OHE
array([[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 1., 0., 0.],
[0., 0., 1., 1., 0., 0.],
[1., 0., 0., 0., 1., 0.]])
But because of the seperate "missing" category - i dont have any nans to impute anymore.
My desired output of one hot encoding
array([[1, 0, np.nan, np.nan],
[0, 1, 1, 0 ],
[np.nan, np.nan, 1, 0 ],
[1, 0, 0, 1 ]
])
Such that I keep nans for later imputation.
Do you know any way to do this?
From my understanding this is something that has been discussed in the scikit-learn Github repo here
and here, i.e. to make OneHotEncoder handle this automatically with a handle_missing
argument, but i am unsure of the status of their work.
Solution
Handling of missing values in OneHotEncoder
ended up getting merged in PR17317, but it operates by just treating the missing values as a new category (no option for other treatments, if I understand correctly).
One manual approach is described in this answer. The first step isn't strictly necessary now because of the above PR, but maybe filling with custom text will make it easier to find the column?
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.