Tuesday, February 1, 2022

[FIXED] Why is Scikit-learn RFECV returning very different features for the training dataset?

February 01, 2022 feature-selection, machine-learning, scikit-learn No comments

Issue

I have been experimenting with RFECV on the Boston dataset.

My understanding, thus far, is that to prevent data-leakage, it is important to perform activities such as this, only on the training data and not the whole dataset.

I performed RFECV on just the training data, and it indicated that 13 of the 14 features are optimal. However, I then ran the same process on the whole dataset, and this time around, it indicated that only 6 of the features are optimal - which seems more likely.

To illustrate:

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

### CONSTANTS
TARGET_COLUMN = 'Price'
TEST_SIZE = 0.1
RANDOM_STATE = 0

### LOAD THE DATA AND ASSIGN TO X and y
data_dict = load_boston()
data = data_dict.data
features = list(data_dict.feature_names)
target = data_dict.target

df = pd.DataFrame(data=data, columns=features)
df[TARGET_COLUMN] = target

X = df[features]
y = df[TARGET_COLUMN]

### PERFORM TRAIN TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, 
random_state=RANDOM_STATE)

#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
X_input = X_train
y_input = y_train

## All the data
# X_input = X
# y_input = y

### IMPLEMENT RFECV AND GET RESULTS
rfecv = RFECV(estimator=LinearRegression(), step=1, scoring='neg_mean_squared_error')
rfecv.fit(X_input, y_input)

rfecv.transform(X_input)

print(f'Optimal number of features: {rfecv.n_features_}')
imp_feats = X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1)
print('Important features:', list(imp_feats.columns))

Running the above will result in:

Optimal number of features: 13
Important features: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

Now, if change the code so that RFECV fits all the data:

#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
# X_input = X_train # NOW COMMENTED OUT
# y_input = y_train # NOW COMMENTED OUT

## All the data
X_input = X  # NOW UN-COMMENTED
y_input = y  # NOW UN-COMMENTED

and run it, I get the following result:

Optimal number of features: 6
Important features: ['CHAS', 'NOX', 'RM', 'DIS', 'PTRATIO', 'LSTAT']

I don't understand why the results are so markedly different (and seemingly more accurate) for the whole dataset as opposed to just the training set.

I have tried making the training set close to the size of the whole data, by making the test_size extremely small (via my TEST_SIZE constant), but I still get this seemingly unlikely difference.

What am I missing?

Solution

It certainly seems like unexpected behavior, and especially when, as you say, you can reduce the test size to 10% or even 5% and find a similar disparity, which seems very counter-intuitive. The key to understanding what's going on here is to realize that for this particular dataset the values in each column are not randomly distributed across the rows (for example, try running X['CRIM'].plot()). The train_test_split function you're using to split the data has a parameter shuffle which defaults to True. So if you look at the X_train dataset you'll see that the index is jumbled up, whereas in X it is sequential. This means that when the cross-validation is performed under the hood by the RFECV class, it is getting a biased subset of data in each split of X, but a more representative/random subset of data in each split of X_train. If you pass shuffle=False to train_test_split you'll see that the two results are much closer (or alternatively, and probably better, try shuffling the index of X).

Answered By - Toby Petty

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 1, 2022

[FIXED] Why is Scikit-learn RFECV returning very different features for the training dataset?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels