Issue
I have imported a dataset I am trying to work with which contains around 5,000 rows and 85 columns. I am trying to feed the data into sklearn function for feature analysis but am running into an error whereby somewhere in the dataframe there is a string but the function only works with float or int. I already had the issue where nan and inf values existed but have managed to deal with them. Now the problem is trying to locate where the string values are in the dataframe.
I have found solutions for searching a dataframe for an exact or partial string match but have had no luck finding a solution to this problem e.g. finding a cell containing any string value.
I have tried df.dtypes but this reports all columns are of type int or float - it also reported the same thing when there was nan and inf values there too.
Dataset is the testing.csv from https://drive.google.com/drive/folders/1XIlVteHaHFqBXqNApYGb3RoHcBkqpRoR
Code:
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
import numpy as np
# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
svicpath = Path("SVIC APT 2021/Testing.csv")
ds = pd.read_csv(svicpath)
#Fill and in with 0
ds.replace([np.inf, -np.inf], 0, inplace=True)
#Fill any nan with 0
ds = ds.fillna(0)
y = ds.iloc[:,-1:]
X = ds.iloc[:, :-1]
#Remove non numeric cols:
X = ds._get_numeric_data()
#Feature Extraction:
#configure to select all features
fs = SelectKBest(score_func=f_classif, k='8')
# learn relationship from training data
fs.fit(X, y)
# what are scores for the features
for i in range(len(fs.scores_)):
print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()
Error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in 30 fs = SelectKBest(score_func=f_classif, k='8') 31 # learn relationship from training data ---> 32 fs.fit(X, y) 33 34
~\Anaconda3\lib\site-packages\sklearn\feature_selection_univariate_selection.py in fit(self, X, y) 346 % (self.score_func, type(self.score_func))) 347 --> 348 self._check_params(X, y) 349 score_func_ret = self.score_func(X, y) 350 if isinstance(score_func_ret, (list, tuple)):
~\Anaconda3\lib\site-packages\sklearn\feature_selection_univariate_selection.py in _check_params(self, X, y) 509 510 def _check_params(self, X, y): --> 511 if not (self.k == "all" or 0 <= self.k <= X.shape[1]): 512 raise ValueError("k should be >=0, <= n_features = %d; got %r. " 513 "Use k='all' to return all features."
TypeError: '<=' not supported between instances of 'int' and 'str'
Solution
"Now the problem is trying to locate where the string values are in the dataframe." and "I have tried df.dtypes but this reports all columns are of type int or float." are two contradictory statements.
You likely only have numbers, NaNs, or Inf.
You can identify them using numpy.isfinite
and numpy.where
:
idx, col = np.where(~np.isfinite(df))
list(zip(df.index[idx], df.columns[col]))
# [(0, 'col2'), (1, 'col3')]
If you really have non-numbers:
idx, col = np.where(~np.isfinite(df.apply(pd.to_numeric, errors='coerce')))
list(zip(df.index[idx], df.columns[col]))
Used input:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [np.nan, 4, 5], 'col3': [6, np.inf, 7]})
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.