Wednesday, March 16, 2022

[FIXED] How to define 'min sample split' and 'min sample leaf' in decision tree regressor?

March 16, 2022 decision-tree, python, python-3.x, scikit-learn No comments

Issue

I am working on a dataset composed by 20060 rows and 10 columns and I am approaching decision tree regressor to make prediction.

My willing is to use RandomizedsearchCV in order to tune hyperparameters; my doubt is what to write in the dictionary as value for 'min_sample_leaf' and 'min_sample_split'.

My professor told me to rely on the database dimension but I don't understand how!

This is a code example:

def model(model,params):
    r_2 = [] 
    mae_ = []
    rs= RandomizedSearchCV(model,params, cv=5, n_jobs=-1, n_iter=30)
    start = time()
    rs.fit(X_train,y_train)
    #prediction on test data
    y_pred =rs.predict(X_test)
    #R2
    r2= r2_score(y_test, y_pred).round(decimals=2)
    print('R2 on test set: %.2f' %r2)
    r_2.append(r2)
    #MAE
    mae = mean_absolute_error(y_test, y_pred).round(decimals=2)
    print('Mean absolute Error: %.2f' %mae)
    mae_.append(mae)
    #print running time
    print('RandomizedSearchCV took: %.2f' %(time() - start),'seconds')
    return r_2, mae_ 

params= {

    'min_samples_split':np.arange(), #define these two hypeparameter relying on database???
    'min_samples_leaf':np.arange()
}

DT = model(DecisionTreeRegressor(), params)

Can anybody explain me?

Thank you very much

Solution

What your professor said is to check your data size so that you can decide your parameter values.

For DecisionTreeRegressor, you can see that min_samples_split and min_samples_leaf depend on your n_samples which is the number of rows. The documentation says the same thing for both parameters:

min_samples_split: int or float, default=2

The minimum number of samples required to split an internal node:

· If int, then consider min_samples_split as the minimum number.

· If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

As the documentation says, if you want to use the parameters making reference to the n_samples (as your teacher says to you), you have to use floats that will represent a fraction (between 0.0 and 1.0) of your number of samples.

For example if you want define min_sample_split is 100, you can write it with two ways: simply 100 or you using the float format 0.005 (you can see that 0.005*20060 is equal to 100).

Using floats allow you to use values that are independent of the number of samples. This is an advantage.

Anyway, I will tell you that probably you are not going to find some big improvements, since the default is super small.

This is applicable for min_sample_split and min_samples_leaf.

Answered By - Ed Piedad

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, March 16, 2022

[FIXED] How to define 'min sample split' and 'min sample leaf' in decision tree regressor?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels