Issue
I'm trying to plot the CV score for recursive feature selection for a wine quality data set. I can't work out why my plot ends up splitting out into 5 lines... My expectation is that it will be 1 line showing the change in CV score for a number of selected features.
rfe.grid_scores_ comes out as a list of lists... I've no idea why its doing this. All examples I have seen plotting grid_scores like this show the plot as one line without any indexing of the list...
dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(dataset_url)
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
# define dataset
x = df.iloc[:, 0:-1]
y = df.quality
# create pipeline
rfe = RFECV(estimator=DecisionTreeClassifier())
model = RandomForestRegressor()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
pipeline.fit(x,y)
# evaluate model
cv = RepeatedKFold(n_splits=10)
n_scores = cross_val_score(pipeline, x, y, cv=cv, n_jobs=1, error_score='raise', scoring = 'neg_mean_squared_error')
plt.figure(figsize=(12,6))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(rfe.grid_scores_) + 1), rfe.grid_scores_)
plt.show()
Solution
You're plotting five lines because rfe.grid_scores_
returns an array of (number of features, scores per fold) or (11,5). By default, RFECV
uses a stratified, 5-fold split. Each line in your plot shows the scores across the number of features for each of the 5 folds.
If you want to use the specified folds and method, you'll need to pass those to the RFECV
.
dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(dataset_url, sep = ';')
# define dataset
x = df.iloc[:, 0:-1]
y = df.quality
### Define the cross-validation strategy here so that it is the same in all parts of your pipeline
# create pipeline
cv = RepeatedKFold(n_splits=10) # 10-folds specified by you, 10 repeats (default)
rfe = RFECV(estimator=DecisionTreeClassifier(), cv=cv)
The resulting plot will now plot 100 lines over 11 features, (10 folds times 10 repeats).
Notice also that these are the grid scores, not your cross-validation scores. You'll need to plot n_scores
to view the actual cross-validation scores.
Answered By - m13op22
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.