Wednesday, January 12, 2022

[FIXED] Building an ensemble of Random Forest models with VotingClassifier()

January 12, 2022 ensemble-learning, machine-learning, python, scikit-learn No comments

Issue

I'm trying to build an ensemble of some models using VotingClassifier() from Sklearn to see if it works better than the individual models. I'm trying it in 2 different ways.

I'm trying to do it with individual Random Forest, Gradient Boosting, and XGBoost models.
I'm trying to build it using an ensemble of many Random Forest models (using different parameters for n_estimators and max_depth.

In the first condition, I'm doing this

estimator = []
estimator.append(('RF', RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=8, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=900,
                       n_jobs=-1, oob_score=True, random_state=66, verbose=0,
                       warm_start=True)))
estimator.append(('GB', GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.03, loss='deviance', max_depth=5,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=1000,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=66, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)))
estimator.append(('XGB', xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=9,
              min_child_weight=1, n_estimators=1000, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)))

And when I do

ensemble_model_churn = VotingClassifier(estimators = estimator, voting ='soft')

and display ensemble_model_churn, I get everything in the output.

But under the second condition, I'm doing this

estimator = []
estimator.append(('RF_1',RandomForestClassifier(n_estimators=500,max_depth=5,warm_start=True)))
estimator.append(('RF_2',RandomForestClassifier(n_estimators=500,max_depth=6,warm_start=True)))
estimator.append(('RF_3',RandomForestClassifier(n_estimators=500,max_depth=7,warm_start=True)))
estimator.append(('RF_4',RandomForestClassifier(n_estimators=500,max_depth=8,warm_start=True)))

And so on. I have 30 different models like that.

But this time, when I do

ensemble_model_churn = VotingClassifier(estimators = estimator, voting ='soft')

and display it, I get only the first one, and not the other ones.

print(ensemble_model_churn)
>>>VotingClassifier(estimators=[('RF_1',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=5,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=500,
                                                     n_jobs=None,
                                                     oob_score=...
                                                     criterion='gini',
                                                     max_depth=5,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=500,
                                                     n_jobs=None,
                                                     oob_score=False,
                                                     random_state=None,
                                                     verbose=0,
                                                     warm_start=True))],
                 flatten_transform=True, n_jobs=None, voting='soft',
                 weights=None)

Why is this happening? Is it not possible to run an ensemble of the same model?

Solution

You are seeing more than one of the estimators, it's just a little hard to tell. Notice the ellipses (...) after the first oob_score parameter, and that after those some of the hyperparameters are repeated. Python just doesn't want to print such a giant wall of text, and has trimmed out most of the middle. You can check that len(ensemble_model_churn.estimators) > 1.

Another note: sklearn is very against doing any validation at model initiation, preferring to do such checking at fit time. (This is because of the way they clone estimators in grid searches and such.) So it's very unlikely that anything will be changed from your explicit input until you call fit.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 12, 2022

[FIXED] Building an ensemble of Random Forest models with VotingClassifier()

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels