Issue
I am working on a dataset composed by 20060 rows and 10 columns and I am approaching decision tree regressor to make prediction.
My willing is to use RandomizedsearchCV in order to tune hyperparameters; my doubt is what to write in the dictionary as value for 'min_sample_leaf' and 'min_sample_split'.
My professor told me to rely on the database dimension but I don't understand how!
This is a code example:
def model(model,params):
r_2 = []
mae_ = []
rs= RandomizedSearchCV(model,params, cv=5, n_jobs=-1, n_iter=30)
start = time()
rs.fit(X_train,y_train)
#prediction on test data
y_pred =rs.predict(X_test)
#R2
r2= r2_score(y_test, y_pred).round(decimals=2)
print('R2 on test set: %.2f' %r2)
r_2.append(r2)
#MAE
mae = mean_absolute_error(y_test, y_pred).round(decimals=2)
print('Mean absolute Error: %.2f' %mae)
mae_.append(mae)
#print running time
print('RandomizedSearchCV took: %.2f' %(time() - start),'seconds')
return r_2, mae_
params= {
'min_samples_split':np.arange(), #define these two hypeparameter relying on database???
'min_samples_leaf':np.arange()
}
DT = model(DecisionTreeRegressor(), params)
Can anybody explain me?
Thank you very much
Solution
What your professor said is to check your data size so that you can decide your parameter values.
For DecisionTreeRegressor, you can see that min_samples_split
and min_samples_leaf
depend on your n_samples
which is the number of rows. The documentation says the same thing for both parameters:
min_samples_split: int or float, default=2
The minimum number of samples required to split an internal node:
· If int, then consider min_samples_split as the minimum number.
· If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
As the documentation says, if you want to use the parameters making reference to the n_samples
(as your teacher says to you), you have to use floats that will represent a fraction (between 0.0 and 1.0) of your number of samples.
For example if you want define min_sample_split
is 100, you can write it with two ways: simply 100 or you using the float format 0.005 (you can see that 0.005*20060 is equal to 100).
Using floats allow you to use values that are independent of the number of samples. This is an advantage.
Anyway, I will tell you that probably you are not going to find some big improvements, since the default is super small.
This is applicable for min_sample_split
and min_samples_leaf
.
Answered By - Ed Piedad
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.