Friday, October 8, 2021

[FIXED] Encoding data for imputation and then decoding

October 08, 2021 numpy, pandas, python, python-2.7, scikit-learn No comments

Issue

I am in python and I have data like the following structure with mixed categorical and numeric

subject_id hour_measure         urinecolor   blood pressure                  
3          1.00                 red         
           1.15                             high
4          2.00              yellow          low

I want to impute it using hot deck imputation but I found that I should encode it to numeric then make an imputation

  from sklearn.preprocessing import OneHotEncoder
    enc = OneHotEncoder(handle_unknown='ignore')
   df= pd.read_csv('path')
    enc.fit(df)
    enc.transform(df)

when I try to make encode, it asking me to fill in missing values first, so how can I deal with missing values when making encoding? When I encode the categorical data, the imputation will generate values for missing values and how can I reverse it to the original data after imputation? please, anyone, help me with this issue?

Solution

Basically you need to use a scikit-learn pipeline:

import numpy as np

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

X = np.array(
    [['cat1', 'cat1'],
     ['cat2', np.nan],
     [np.nan, 'cat2']],
    dtype=object
)

encoder = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(sparse=False)
)
print(encoder.fit_transform(X))
print(encoder[-1].categories_)

[[1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 1.]
 [0. 0. 1. 0. 1. 0.]]
[array(['cat1', 'cat2', 'missing'], dtype=object), array(['cat1', 'cat2', 'missing'], dtype=object)]

Here the missing values are represented by some np.nan values. They are first replaced by the string 'missing'. Then, each category will become a column. Therefore the "missing" information will be represented by a column after the encoding.

You probably don't want to remove this information from your data. If you want you can remove the corresponding columns.

Answered By - glemaitre

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, October 8, 2021

[FIXED] Encoding data for imputation and then decoding

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels