Saturday, September 17, 2022

[FIXED] Using GridSearchCV for RandomForestRegressor

September 17, 2022 cross-validation, python, random-forest, scikit-learn No comments

Issue

I'm trying to use GridSearchCV for RandomForestRegressor, but always get ValueError: Found array with dim 100. Expected 500. Consider this toy example:

import numpy as np

from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import r2_score

if __name__ == '__main__':

    X = np.random.rand(1000, 2)
    y = np.random.rand(1000)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.5, random_state=1)

    # Set the parameters by cross-validation
    tuned_parameters = {'n_estimators': [500, 700, 1000], 'max_depth': [None, 1, 2, 3], 'min_samples_split': [1, 2, 3]}

    # clf = ensemble.RandomForestRegressor(n_estimators=500, n_jobs=1, verbose=1)
    clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5, scoring=r2_score, n_jobs=-1, verbose=1)
    clf.fit(X_train, y_train)
    print clf.best_estimator_

This is what I get:

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Traceback (most recent call last):
  File "C:\Users\abudis\Dropbox\machine_learning\toy_example.py", line 21, in <module>
    clf.fit(X_train, y_train)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.py", line 1240, in _fit_and_score
    test_score = _score(estimator, X_test, y_test, scorer)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.py", line 1296, in _score
    score = scorer(estimator, X_test, y_test)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\metrics.py", line 2324, in r2_score
    y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\metrics.py", line 65, in _check_reg_targets
    y_true, y_pred = check_arrays(y_true, y_pred)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\utils\validation.py", line 254, in check_arrays
    % (size, n_samples))
ValueError: Found array with dim 100. Expected 500

For some reason GridSearchCV thinks that n_estimators parameter should be equal to the size of each fold. If I change the first value of the n_estimators in the tuned_parameters list I get ValueError with another expected value.

Training just one model using clf = ensemble.RandomForestRegressor(n_estimators=500, n_jobs=1, verbose=1) works fine though, so not sure if I'm doing something wrong or there's a bug in scikit-learn somewhere.

Solution

Looks like a bug, but in your case it should work if you use RandomForestRegressor's own scorer (which coincidentally is R^2 score) by not specifying any scoring function in GridSearchCV:

clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5, 
                   n_jobs=-1, verbose=1)

EDIT: As mentioned by @jnothman in #4081 this is the real problem:

scoring does not accept a metric function. It accepts a function of signature (estimator, > X, y_true=None) -> float score. You can use scoring='r2' or scoring=make_scorer(r2_score).

Answered By - elyase

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 17, 2022

[FIXED] Using GridSearchCV for RandomForestRegressor

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels