Sunday, October 24, 2021

[FIXED] How can I tune the parameters in a Random Forest Classifier inside a pipeline?

October 24, 2021 gridsearchcv, python, random-forest, scikit-learn No comments

Issue

I was trying to apply RandomForestClassifier() by using a pipeline and tuning the parameters inside it. This is the dataset being used: https://www.kaggle.com/gbonesso/enem-2016

And here's the code

from sklearn.ensemble import RandomForestClassifier

imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()
rf = RandomForestClassifier()

features = [
    "NU_IDADE",
    "TP_ESTADO_CIVIL",
    "NU_NOTA_CN",
    "NU_NOTA_CH",
    "NU_NOTA_LC",
    "NU_NOTA_MT",
    "NU_NOTA_COMP1",
    "NU_NOTA_COMP2",
    "NU_NOTA_COMP3",
    "NU_NOTA_COMP4",
    "NU_NOTA_COMP5",
    "NU_NOTA_REDACAO",
]

X = enem[features]
y = enem[["IN_TREINEIRO"]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=42
)

pipeline = make_pipeline(imputer, scaler, rf)

pipe_params = {
    "randomforestregressor__n_estimators": [100, 500, 1000],
    "randomforestregressor__max_depth": [1, 5, 10, 25],
    "randomforestregressor__max_features": [*np.arange(0.1, 1.1, 0.1)],
}

gridsearch = GridSearchCV(
    pipeline, param_grid=pipe_params, cv=3, n_jobs=-1, verbose=1000
)

gridsearch.fit(X_train, y_train)

It seem to work for a few parameters, but then I get this error message:

ValueError: Invalid parameter randomforestregressor for estimator Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
            ('standardscaler', StandardScaler()),
            ('randomforestclassifier', RandomForestClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.

Also, one more issue is that I can't seem to get the cv results. I tried running the following code:

results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values("rank_test_score").head()
score = pipeline.score(X_test, y_test)
score

But I got this error:

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

Any ideas on how to fix these errors?

Solution

Nick's answer is definitely right and will indeed solve your problem. In your case you can instantiate the pipeline avoiding make_pipeline in favour of the Pipeline class. I believe it's a tad more readable and concise:

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier())
])

And access model parameters prefixing them with your classifier name:

param_grid = {
    "clf__n_estimators": [100, 500, 1000],
    "clf__max_depth": [1, 5, 10, 25],
    "clf__max_features": [*np.arange(0.1, 1.1, 0.1)],
}

Below a complete example based on the iris dataset:

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
import numpy as np


# Data preparation
iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.33, random_state=42
)

# Build a pipeline object
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier())
])

# Declare a hyperparameter grid
param_grid = {
    "clf__n_estimators": [100, 500, 1000],
    "clf__max_depth": [1, 5, 10, 25],
    "clf__max_features": [*np.arange(0.1, 1.1, 0.1)],
}

# Perform grid search, fit it, and print score
gs = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, verbose=1000)
gs.fit(x_train, y_train)
print(gs.score())

Answered By - anddt

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 24, 2021

[FIXED] How can I tune the parameters in a Random Forest Classifier inside a pipeline?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels