Issue
I am doing hyperparameter tuning with GridSearchCV
for Decision Trees. I have fit the model and I am trying to find what does exactly Gridsearch.cv_results_
gives. I have read the documentation but still its not clear. Could anyone explain this attribute?
my code is below :
depth={"max_depth":[1,5,10,50,100,500,1000],
"min_samples_split":[5,10,100,500]}
DTC=DecisionTreeClassifier(class_weight="balanced")
DTC_Grid=GridSearchCV(DTC,param_grid=depth , cv=3, scoring='roc_auc')
DTC_Bow=DTC_Grid.fit(xtrain_bow,ytrain_bow)
Solution
DTC_Bow.cv_results_ returns a dictionary of all the evaluation metrics from the gridsearch. To visualize it properly, you can do
pd.DataFrame(DTC_Bow.cv_results_)
In your case, this should return a dataframe with 28 rows (7 choices for max_depth
times 4 choices for min_samples_split
). Each row of this dataframe gives the gridsearch metrics for one combination of these two parameters. Remember the goal of a gridsearch is to select which combination of parameters will have the best performance metrics. This is the purpose of cv_results_
.
You should have one column called param_max_depth
and another called param_min_samples_leaf
referencing the value of the parameter for each row. The combination of two is summarized as a dictionary in the column params
.
Now to the metrics. Default value for return_train_score
was True
up until now but they will change it to False
in version 0.21. If you want the train metrics, set it to True
. But usually, what you are interested in are the test metrics.
The main column is mean_test_score
. This is the average of columns split_0_test_score, split_1_test_score, split_2_test_score
(because you are doing a 3 fold split in your gridsearch). If you do DTC_Bow.best_score_
this will return the max value of the column mean_test_score
. Column rank_test_score
ranks all parameter combinations by the values of mean_test_score
.
You might also want to look at std_test_score
which is the standard deviation of split_0_test_score, split_1_test_score, split_2_test_score
. This might be of interest if you want to see how consistently your set of parameters is performing on your hold-out data.
As mentioned, you can have the metrics on the train set as well provided you set return_train_score = True
.
Finally, there are also time columns, that tell you how much time it took for each row. It measures how much time it took to train the model (mean_fit_time, std_fit_time
) and to evaluate it (mean_score_time, std_score_time
). This is just a FYI, usually, unless time is a bottleneck, you would not really look at these metrics.
Answered By - MaximeKan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.