Saturday, May 14, 2022

[FIXED] Sklearn error: None of [Int64Index([2, 3], dtype='int64')] are in the [columns]

May 14, 2022 python, scikit-learn No comments

Issue

Could someone explain why this code:

from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
import numpy as np

#df = pd.read_csv('missing_data.csv',sep=',')

df = pd.DataFrame(np.array([[1, 2, 3,4,5,6,7,8,9,1],
                            [4, 5, 6,3,4,5,7,5,4,1],
                            [7, 8, 9,6,2,3,6,5,4,1],
                            [7, 8, 9,6,1,3,2,2,4,0],
                            [7, 8, 9,6,5,6,6,5,4,0]]),
                            columns=['a', 'b', 'c','d','e','f','g','h','i','j'])

X_train = df.iloc[:,:-1]
y_train = df.iloc[:,-1]


clf=SVC(kernel='linear')
kfold = StratifiedKFold(n_splits=2,random_state=42,shuffle=True)
for train_index,test_index in kfold.split(X_train,y_train):
    x_train_fold,x_test_fold = X_train[train_index],X_train[test_index]
    y_train_fold,y_test_fold = y_train[train_index],y_train[test_index]
    clf.fit(x_train_fold,y_train_fold)

Throws this error:

Traceback (most recent call last):
  File "test_traintest.py", line 23, in <module>
    x_train_fold,x_test_fold = X_train[train_index],X_train[test_index]
  File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/frame.py", line 3030, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/indexing.py", line 1308, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([2, 3], dtype='int64')] are in the [columns]"

I saw this answer, but the length of my columns is equal.

Solution

KFold.split() returns the train and test indices, which should be used with a DataFrame like this:

X_train.iloc[train_index]

With your syntax, you are trying to use them as column names. Change your code to:

from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
import numpy as np

#df = pd.read_csv('missing_data.csv',sep=',')

df = pd.DataFrame(np.array([[1, 2, 3,4,5,6,7,8,9,1],
                            [4, 5, 6,3,4,5,7,5,4,1],
                            [7, 8, 9,6,2,3,6,5,4,1],
                            [7, 8, 9,6,1,3,2,2,4,0],
                            [7, 8, 9,6,5,6,6,5,4,0]]),
                            columns=['a', 'b', 'c','d','e','f','g','h','i','j'])

X_train = df.iloc[:,:-1]
y_train = df.iloc[:,-1]


clf=SVC(kernel='linear')
kfold = StratifiedKFold(n_splits=2,random_state=42,shuffle=True)
for train_index,test_index in kfold.split(X_train,y_train):
    x_train_fold,x_test_fold = X_train.iloc[train_index],X_train.iloc[test_index]
    y_train_fold,y_test_fold = y_train.iloc[train_index],y_train.iloc[test_index]
    clf.fit(x_train_fold,y_train_fold)

Note that we use .iloc and not .loc. This is because .iloc works with integer indices as the ones we get from split(), while .loc works on index values. In your case it doesn't matter, since the pandas index matches the integer indices, but in other projects you will encounter it may not be the case, so stick with .iloc.

Alternatively, when you extract X_train and y_train you can convert them to numpy arrays:

X_train = df.iloc[:,:-1].to_numpy()
y_train = df.iloc[:,-1].to_numpy()

and then your code will work fine because a numpy array works fine with integer indices.

Answered By - user2246849

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 14, 2022

[FIXED] Sklearn error: None of [Int64Index([2, 3], dtype='int64')] are in the [columns]

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels