Issue
I'm trying to found a set of best hyperparameters for my Gradient Boosting Regressor with Grid Search CV. But I have difficulties getting the performance of the best model.
My code is as follows, this function is expected to return an optimized model.
def parameter_tuning_Gradient_Boost(X,
y,
):
model = GradientBoostingRegressor()
param_grid = { 'learning_rate': [0.005, 0.01, 0.02, 0.05, 0.1],
'subsample' : [1.0, 0.8, 0.6],
'n_estimators' : [100, 200, 500, 1000],
'max_depth' : [2, 4, 6, 8, 10]
}
grid_search = GridSearchCV( model,
param_grid,
cv = 5,
n_jobs = 8,
verbose = 0)
grid_search.fit(X = X,
y = y,)
print('Best Parameters by Searching: %s' % grid_search.best_params_)
best_parameters = grid_search.best_estimator_.get_params()
model = GradientBoostingRegressor( learning_rate = best_parameters['learning_rate'],
subsample = best_parameters['subsample'],
n_estimators = best_parameters['n_estimators'],
max_depth = best_parameters['max_depth'],
)
return model
In general, I have the following questions:
- Do I have to use
train_test_split
function to split X and y, and then feedX_train
andy_train
togrid_search.fit
function? Some saidGridSearchCV
will automatically split data into train and test if you set cv = 5. But I saw some online tutorial will do something like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
grid_search.fit(X_train, y_train)
- What is the metrics score for a regressor in GridSearchCV? after fitting GridSearchCV, I run the follwing commands and get way different score. I am wondering what is the correct way to get model cv perforamce for a regressor in GridSearchCV sklearn.
print("Best Score:", grid_search.score(X, y))
print("Best Score: %.3f" % grid_search.best_score_)
I tried to apply GridSearchCV method to perform parameter tuning for a regressor and get its cross-validation performance, I want to know what is the default evaluation metrics here, and I want to know do I have to split the data into train and test set when I set cv parameter in GridSearchCV.
Solution
Yes
GridSearchCV
will split the data into 5 train/test splits. It will then use these splits to find the optimal hyperparameters. However, it's also good practice to set aside a completely unseen split of the data. That you score the model(s) on when you are completely done with training. Take a look at this article to read more on this. Remember, after evaluating the model on the unseen data set. You are not "allowed" to improve your model.The scoring metric for
GridSearchCV
can be user defined. But as a standard it will use thescoring
parameter from the estimator. In this casesquared_error
which is the default scoring forGradientBoostingRegressor
, the parameter is namedloss=squared_error
.
To answer the second part of the question, we need to understand what happens in GridSearchCV
when fitting. When the optimal hyperparameters are found, a model is refitted on all the data with the optimal hyperparameters, if refit=True
, which it is by default.
grid_search.score(X,y)
scores the refitted model on all of the data while
grid_search._best_score
returns the average score of the models with the optimal hyperparameters on the five splits.
Answered By - hampus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.