Friday, April 8, 2022

[FIXED] ValueError could not convert string to float: is IterativeImputer in sklearn only for numerical features?

April 08, 2022 python, scikit-learn No comments

Issue

I have a data frame that has this kind of data in it:

ID      category
ID2     category
Sex     category
Cysts   category
Death   category
Years   int64
Group   category

An example of the data:

0    11090    1  0  0  0  46  1
1    10336    5  0  0  1  60  2
2     8117    8  1  0  1  39  1
3    10262    9  0  0  1  37  5
4    11084   10  0  0  1  47  1

There are 15 missing entries in column 'Cysts' that I want to impute.

When I write this code for SimpleImputer:

import pandas as pd
import scipy.sparse as sp
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np

df = pd.read_csv('filtered.txt',sep='\t',dtype='category').iloc[:,:7]
print(df.dtypes)
imp = SimpleImputer(missing_values='-1',strategy='most_frequent')
df = pd.DataFrame(imp.fit_transform(df))
print(df)

it prints an output as expected:

..     ...  ... .. .. ..  .. ..
209  10373  164  1  1  0  44  1
210  11267  171  1  1  0  81  6
211  11101  175  1  1  1  65  1
212  11232  176  1  1  0  28  1
213  11236  176  1  1  0  31  1

(i.e. the -1s that were originally in this column as missing data are replaced with 1s in column 4).

import pandas as pd
import scipy.sparse as sp
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np

df = pd.read_csv('filtered.txt',sep='\t',dtype='category').iloc[:,:7]
print(df.dtypes)
imp = IterativeImputer(missing_values='-1',initial_strategy='most_frequent')
df = pd.DataFrame(imp.fit_transform(df))
print(df)

But I get the error:

ValueError: could not convert string to float: '8127/10206'

That value is one of the values in the ID2 column, I'm aware it's not a float, it's not meant to be.

Can the iterative imputer only be used with numeric columns? I thought by having a 'most_frequent' initial_strategy parameter that categorical data could be used, but maybe I'm wrong?

Solution

IterativeImputer uses an Estimator object (Bayesian Ridge regression by default) to iteratively make better predictions for each column's missing values using the values of the other columns as features. It does not support non-numeric data. You may be able to get acceptable results if you cast the data to numeric, iteratively impute, then re-discretize it .

I have not tested this, but you may be able to one-hot encode strictly categorical data and iteratively impute it with a Classifier.

Answered By - eschibli

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, April 8, 2022

[FIXED] ValueError could not convert string to float: is IterativeImputer in sklearn only for numerical features?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels