Issue
I am in python and I have data like the following structure with mixed categorical and numeric
subject_id hour_measure urinecolor blood pressure
3 1.00 red
1.15 high
4 2.00 yellow low
I want to impute it using hot deck imputation but I found that I should encode it to numeric then make an imputation
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
df= pd.read_csv('path')
enc.fit(df)
enc.transform(df)
when I try to make encode, it asking me to fill in missing values first, so how can I deal with missing values when making encoding? When I encode the categorical data, the imputation will generate values for missing values and how can I reverse it to the original data after imputation? please, anyone, help me with this issue?
Solution
Basically you need to use a scikit-learn pipeline:
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
X = np.array(
[['cat1', 'cat1'],
['cat2', np.nan],
[np.nan, 'cat2']],
dtype=object
)
encoder = make_pipeline(
SimpleImputer(strategy="constant", fill_value="missing"),
OneHotEncoder(sparse=False)
)
print(encoder.fit_transform(X))
print(encoder[-1].categories_)
[[1. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 1.]
[0. 0. 1. 0. 1. 0.]]
[array(['cat1', 'cat2', 'missing'], dtype=object), array(['cat1', 'cat2', 'missing'], dtype=object)]
Here the missing values are represented by some np.nan
values. They are first replaced by the string 'missing'
. Then, each category will become a column. Therefore the "missing" information will be represented by a column after the encoding.
You probably don't want to remove this information from your data. If you want you can remove the corresponding columns.
Answered By - glemaitre
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.