Issue
I have been experimenting with RFECV on the Boston dataset.
My understanding, thus far, is that to prevent data-leakage, it is important to perform activities such as this, only on the training data and not the whole dataset.
I performed RFECV on just the training data, and it indicated that 13 of the 14 features are optimal. However, I then ran the same process on the whole dataset, and this time around, it indicated that only 6 of the features are optimal - which seems more likely.
To illustrate:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
### CONSTANTS
TARGET_COLUMN = 'Price'
TEST_SIZE = 0.1
RANDOM_STATE = 0
### LOAD THE DATA AND ASSIGN TO X and y
data_dict = load_boston()
data = data_dict.data
features = list(data_dict.feature_names)
target = data_dict.target
df = pd.DataFrame(data=data, columns=features)
df[TARGET_COLUMN] = target
X = df[features]
y = df[TARGET_COLUMN]
### PERFORM TRAIN TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE,
random_state=RANDOM_STATE)
#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
X_input = X_train
y_input = y_train
## All the data
# X_input = X
# y_input = y
### IMPLEMENT RFECV AND GET RESULTS
rfecv = RFECV(estimator=LinearRegression(), step=1, scoring='neg_mean_squared_error')
rfecv.fit(X_input, y_input)
rfecv.transform(X_input)
print(f'Optimal number of features: {rfecv.n_features_}')
imp_feats = X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1)
print('Important features:', list(imp_feats.columns))
Running the above will result in:
Optimal number of features: 13
Important features: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
Now, if change the code so that RFECV fits all the data:
#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
# X_input = X_train # NOW COMMENTED OUT
# y_input = y_train # NOW COMMENTED OUT
## All the data
X_input = X # NOW UN-COMMENTED
y_input = y # NOW UN-COMMENTED
and run it, I get the following result:
Optimal number of features: 6
Important features: ['CHAS', 'NOX', 'RM', 'DIS', 'PTRATIO', 'LSTAT']
I don't understand why the results are so markedly different (and seemingly more accurate) for the whole dataset as opposed to just the training set.
I have tried making the training set close to the size of the whole data, by making the test_size extremely small (via my TEST_SIZE
constant), but I still get this seemingly unlikely difference.
What am I missing?
Solution
It certainly seems like unexpected behavior, and especially when, as you say, you can reduce the test size to 10% or even 5% and find a similar disparity, which seems very counter-intuitive. The key to understanding what's going on here is to realize that for this particular dataset the values in each column are not randomly distributed across the rows (for example, try running X['CRIM'].plot()
). The train_test_split
function you're using to split the data has a parameter shuffle
which defaults to True
. So if you look at the X_train
dataset you'll see that the index is jumbled up, whereas in X
it is sequential. This means that when the cross-validation is performed under the hood by the RFECV
class, it is getting a biased subset of data in each split of X
, but a more representative/random subset of data in each split of X_train
. If you pass shuffle=False
to train_test_split
you'll see that the two results are much closer (or alternatively, and probably better, try shuffling the index of X
).
Answered By - Toby Petty
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.