Issue
I'm developping a model to predict the target variable using the RandomForestRegressor from scikit.
I have developped a function to get the mse as below:
def get_mse(n_estimators, max_leaf_nodes, X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=n_estimators, max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(X_train, y_train)
preds_val = model.predict(X_valid)
mse = mean_squared_error(y_valid, preds_val, squared = False)
return(mse)
I would like to use a for loop to get the best mse scores by combining a list of values for n_estimators and max_leaf_nodes
Below are the code that I wrote:
n_estimators = [100,150,200,250]
max_leaf_nodes = [10, 50, 100, 200]
for n_estimators,max_leaf_nodes in zip(n_estimators,max_leaf_nodes):
my_mse = get_mse(n_estimators,max_leaf_nodes, X_train, X_valid, y_train, y_valid)
print("N_estimators: %d \t\t Max leaf nodes: %d \t\t Mean Squared Error: %d" %(n_estimators, max_leaf_nodes, my_mse))
But when I run this for loop, it always return a mse of 0 for each combination of two hyperparameters.
I have tried my function by using the following code and it returns with the correct mse:
get_mse(200, 100, X_train, X_valid, y_train, y_valid)
I'm wondering why my for loop is not working properly by returning me always a 0 mse.
Could someone can help me to solve this issue ?
Thank you
Solution
There are mainly two things to consider:
First, do not shadow the names you already used to declare the list of values (n_estimators
and max_leaf_nodes
). Instead, make them clearly distinguishable:
n_estimators_list = [100, 150, 200, 250]
max_leaf_nodes_list = [10, 50, 100, 200]
for n_estimators, max_leaf_nodes in zip(n_estimators_list, max_leaf_nodes_list):
...
Secondly, as pointed out in the comments above, you should replace the %d
formatter for mse
with %f
since values between 0 and 1 would otherwise be formatted as 0:
print("N_estimators: %d \t\t Max leaf nodes: %d \t\t Mean Squared Error: %f" %(n_estimators, max_leaf_nodes, my_mse))
Personally, I would recommend using one of the newer string formatting options, for example Python 3's f-strings, to avoid such mishaps:
print(f"N_estimators: {n_estimators} \t\t Max leaf nodes: {max_leaf_nodes} \t\t Mean Squared Error: {my_mse}")
A last note that has also been already mentioned in the comments: for hyperparameter tuning, you could use GridSearchCV
which is a pre-implemented functionality to find the best hyperparameters using an exhaustive search over a pre-defined grid. Example usage:
param_grid = {
'n_estimators': [100, 150, 200, 250],
'max_leaf_nodes': [10, 50, 100, 200]
}
gs = GridSearchCV(
estimator=RandomForestRegressor(),
param_grid=param_grid,
scoring='neg_root_mean_squared_error'
)
gs.fit(X, y)
print(gs.best_params_)
The advantage is that this implementation is battle-proven, provides many readily available values and statistics to inspect the result, and uses cross-validation. Furthermore, it will explore all possible hyperparameter combinations (in contrast to your own loop which does not).
You can read more about GridSearchCV
in its documentation.
Answered By - afsharov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.