Issue
My understanding of Recursive Feature Elimination Cross Validation: (sklearn.feature_selection.RFECV
) You provide an algorithm which is trained on the entire dataset and creates a feature importance ranking using attributes coef_
or feature_importances_
. Now with all features included, this algorithm is evaluated by cross validation. Then the feature ranked at the bottom is removed and the model is retrained on the dataset and creates a new ranking, once again assessed by cross validation. This continues until all but one feature remain (or as specified by min_features_to_select
), and the final number of features chosen depends on what yielded the highest CV score. (Source)
Question: The CV score for each number of features is stored in rfecv.cv_results_["mean_test_score"]
, and I've been facing trouble trying to replicate these scores without using scikit's built in method.
This is what I have tried to obtain the score for n-1 features, where n is the total number of features.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import RFECV
alg = DecisionTreeClassifier(random_state = 0)
cv_split = StratifiedKFold(5)
# train is a pandas dataframe, x_var and y_var are both lists containing variable strings
X = train[x_var]
y = np.ravel(train[y_var])
alg.fit(X, y)
lowest_ranked_feature = np.argmin(alg.feature_importances_)
x_var.pop(lowest_ranked_feature)
one_removed_feature = train[x_var]
alg.fit(one_removed_feature, y)
cv_score = cross_validate(alg, one_removed_feature, y, cv=cv_split, scoring="accuracy")
np.mean(cv_score["test_score"])
And this is the inbuilt method that provides a different score:
rfecv = RFECV(
estimator=alg,
step=1,
cv=cv_split,
scoring="accuracy",
)
rfecv.fit(X, y)
rfecv.cv_results_["mean_test_score"][-2]
How do I get the exact scores as calculated in the inbuilt method?
I would also like to mention that I did try this first with all n features, and my method matched with
rfecv.cv_results_["mean_test_score"][-1]
.
Solution
As pointed out by Ben, the reason you are finding different answers is because your understanding of RFECV is fundamentally flawed. Cross validation is not implemented at each step within RFE, but rather RFE is implemented within each fold of cross validation.
Your method currently removes a feature from the data and then performs CV to score it, and essentially performs a different CV to score each subset of features. However, RFECV performs CV only a single time at the start, and then cycles between removing a feature and scoring the model within a singular fold.
Note: From experimentation, I have found that sklearn RFECV computes the score on a singular fold directly by using whichever evaluation metric you specify, and not another layer of cross validation or anything. In your case, you have chosen scoring="accuracy"
. So within a fold, the model will score the test split based on accuracy, and repeat this for each subset of features.
You can implement RFECV from scratch as follows:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
def score(alg, x_train, y_train, x_test, y_test):
"""Calculate the accuracy score of the algorithm."""
alg.fit(x_train, y_train)
y_pred = alg.predict(x_test)
return accuracy_score(y_test, y_pred)
def rfecv_function(alg, x_var, X, y, cv):
"""Perform RFECV and return a dictionary to store an array of test scores
for each cv, where the array contains test scores for each number of
features selected 1,2,...,n (n is total number of features).
"""
dic = {}
# Iterate through folds
for fold_index, (train_index, test_index) in enumerate(cv.split(X, y)):
x_train_fold, y_train_fold = X.iloc[train_index], y.iloc[train_index]
x_test_fold, y_test_fold = X.iloc[test_index], y.iloc[test_index]
features = x_var.copy()
# Array to store test scores for each feature subset
scores_array = np.empty(len(x_var))
# Iterate through the feature subsets
for i in range(len(x_var)):
# Calculate and store the scores in the array
scores = score(alg, x_train_fold[features], y_train_fold,
x_test_fold[features], y_test_fold)
scores_array[-i-1] = scores
# Find and remove the lowest ranked feature
alg.fit(x_train_fold[features], y_train_fold)
lowest_rank = features[np.argmin(alg.feature_importances_)]
features.remove(lowest_rank)
dic[f"split{fold_index}_test_score"] = scores_array
return dic
dtree = DecisionTreeClassifier(random_state = 0)
cv_split = StratifiedKFold(5, shuffle=True, random_state=0)
# Assume train is a pandas dataframe
x_var = ["var1", "var2", "var3"]
y_var = ["target_var"]
rfecv_scores = rfecv_function(dtree, x_var, train[x_var], train[y_var], cv_split)
The rfecv_scores
provide identical values to those computed in the inbuilt class RFECV.cv_results_
Answered By - AvanishM
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.