Issue
I have a data frame that has this kind of data in it:
ID category
ID2 category
Sex category
Cysts category
Death category
Years int64
Group category
An example of the data:
0 11090 1 0 0 0 46 1
1 10336 5 0 0 1 60 2
2 8117 8 1 0 1 39 1
3 10262 9 0 0 1 37 5
4 11084 10 0 0 1 47 1
There are 15 missing entries in column 'Cysts' that I want to impute.
When I write this code for SimpleImputer:
import pandas as pd
import scipy.sparse as sp
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('filtered.txt',sep='\t',dtype='category').iloc[:,:7]
print(df.dtypes)
imp = SimpleImputer(missing_values='-1',strategy='most_frequent')
df = pd.DataFrame(imp.fit_transform(df))
print(df)
it prints an output as expected:
.. ... ... .. .. .. .. ..
209 10373 164 1 1 0 44 1
210 11267 171 1 1 0 81 6
211 11101 175 1 1 1 65 1
212 11232 176 1 1 0 28 1
213 11236 176 1 1 0 31 1
(i.e. the -1s that were originally in this column as missing data are replaced with 1s in column 4).
import pandas as pd
import scipy.sparse as sp
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('filtered.txt',sep='\t',dtype='category').iloc[:,:7]
print(df.dtypes)
imp = IterativeImputer(missing_values='-1',initial_strategy='most_frequent')
df = pd.DataFrame(imp.fit_transform(df))
print(df)
But I get the error:
ValueError: could not convert string to float: '8127/10206'
That value is one of the values in the ID2 column, I'm aware it's not a float, it's not meant to be.
Can the iterative imputer only be used with numeric columns? I thought by having a 'most_frequent' initial_strategy parameter that categorical data could be used, but maybe I'm wrong?
Solution
IterativeImputer
uses an Estimator
object (Bayesian Ridge regression by default) to iteratively make better predictions for each column's missing values using the values of the other columns as features. It does not support non-numeric data. You may be able to get acceptable results if you cast the data to numeric, iteratively impute, then re-discretize it .
I have not tested this, but you may be able to one-hot encode strictly categorical data and iteratively impute it with a Classifier
.
Answered By - eschibli
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.