Issue
One of my projects is using the scikit-learn imputer to handle NaN values, however, it seems to remove rows that are entirely made up of NaN as the following snippet shows:
tmp = [[math.nan, 3.0],[math.nan, 5.0],[math.nan, math.nan]]
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_tmp = imp.fit_transform(np.asarray(tmp, dtype=np.float_))
print(np.asarray(tmp, dtype=np.float_))
print(np.asarray(imp_tmp, dtype=np.float_))
assert len(np.asarray(tmp, dtype=np.float_)[0]) == len(np.asarray(imp_tmp, dtype=np.float_)[0])
In particular, the assertion fails. Does anyone know whether this behavior is documented and how it can be prevented? I could not find anything about the imputer removing NaN rows in the documentation: Simple Imputer
Solution
As explained in the documentation, the columns containing only missing values are discarded unless strategy='constant'
:
Columns which only contained missing values at fit are discarded upon transform if strategy is not “constant”.
This means that in your case the first column is discarded, and you are left only with the second column where the missing value in the last row is correctly replaced by the average of the non-missing values in the previous two rows:
import math
import numpy as np
from sklearn.impute import SimpleImputer
tmp = [[math.nan, 3.0],[math.nan, 5.0],[math.nan, math.nan]]
print(np.asarray(tmp, dtype=np.float_))
# [[nan 3.]
# [nan 5.]
# [nan nan]]
print(np.asarray(tmp, dtype=np.float_).shape)
# (3, 2)
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_tmp = imp.fit_transform(np.asarray(tmp, dtype=np.float_))
print(np.asarray(imp_tmp, dtype=np.float_))
# [[3.]
# [5.]
# [4.]]
print(np.asarray(imp_tmp, dtype=np.float_).shape)
# (3, 1)
Answered By - Flavia Giammarino
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.