Issue
I was trying to apply RandomForestClassifier() by using a pipeline and tuning the parameters inside it. This is the dataset being used: https://www.kaggle.com/gbonesso/enem-2016
And here's the code
from sklearn.ensemble import RandomForestClassifier
imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()
rf = RandomForestClassifier()
features = [
"NU_IDADE",
"TP_ESTADO_CIVIL",
"NU_NOTA_CN",
"NU_NOTA_CH",
"NU_NOTA_LC",
"NU_NOTA_MT",
"NU_NOTA_COMP1",
"NU_NOTA_COMP2",
"NU_NOTA_COMP3",
"NU_NOTA_COMP4",
"NU_NOTA_COMP5",
"NU_NOTA_REDACAO",
]
X = enem[features]
y = enem[["IN_TREINEIRO"]]
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=42
)
pipeline = make_pipeline(imputer, scaler, rf)
pipe_params = {
"randomforestregressor__n_estimators": [100, 500, 1000],
"randomforestregressor__max_depth": [1, 5, 10, 25],
"randomforestregressor__max_features": [*np.arange(0.1, 1.1, 0.1)],
}
gridsearch = GridSearchCV(
pipeline, param_grid=pipe_params, cv=3, n_jobs=-1, verbose=1000
)
gridsearch.fit(X_train, y_train)
It seem to work for a few parameters, but then I get this error message:
ValueError: Invalid parameter randomforestregressor for estimator Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
('standardscaler', StandardScaler()),
('randomforestclassifier', RandomForestClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.
Also, one more issue is that I can't seem to get the cv results. I tried running the following code:
results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values("rank_test_score").head()
score = pipeline.score(X_test, y_test)
score
But I got this error:
AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'
Any ideas on how to fix these errors?
Solution
Nick's answer is definitely right and will indeed solve your problem. In your case you can instantiate the pipeline avoiding make_pipeline
in favour of the Pipeline
class. I believe it's a tad more readable and concise:
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", RandomForestClassifier())
])
And access model parameters prefixing them with your classifier name:
param_grid = {
"clf__n_estimators": [100, 500, 1000],
"clf__max_depth": [1, 5, 10, 25],
"clf__max_features": [*np.arange(0.1, 1.1, 0.1)],
}
Below a complete example based on the iris dataset:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
import numpy as np
# Data preparation
iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.33, random_state=42
)
# Build a pipeline object
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", RandomForestClassifier())
])
# Declare a hyperparameter grid
param_grid = {
"clf__n_estimators": [100, 500, 1000],
"clf__max_depth": [1, 5, 10, 25],
"clf__max_features": [*np.arange(0.1, 1.1, 0.1)],
}
# Perform grid search, fit it, and print score
gs = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, verbose=1000)
gs.fit(x_train, y_train)
print(gs.score())
Answered By - anddt
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.