Issue
Suppose I construct an ensemble of two estimators, where each estimator runs its own parameter search:
Imports and regression dataset:
from sklearn.ensemble import VotingRegressor, StackingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import RandomizedSearchCV
X, y = make_regression()
Define two self-tuning estimators, and ensemble them:
rf_param_dist = dict(n_estimators=[1, 2, 3, 4, 5])
rf_searcher = RandomizedSearchCV(RandomForestRegressor(), rf_param_dist, n_iter=5, cv=3)
dt_param_dist = dict(max_depth=[4, 5, 6, 7, 8])
dt_searcher = RandomizedSearchCV(DecisionTreeRegressor(), dt_param_dist, n_iter=5, cv=3)
ensemble = StackingRegressor(
[ ('rf', rf_searcher), ('dt', dt_searcher) ]
).fit(X, y)
My questions are about how sklearn
handles the fitting of ensemble
.
Q1) We have two unfitted estimators in parallel, and both need to be fitted before ensemble.predict(...)
would work. But we can't fit any of the estimators without first getting a prediction from the ensemble. How does sklearn
handle this circular dependency?
Q2) Since we have two estimators running independent tuning, does each estimator make the false assumption that the parameters of the other estimator are fixed? So we end up with a poorly-defined optimisation problem.
For reference, I think the correct way to jointly optimise the models of an ensemble would be to define a single CV that searches over all parameters jointly, shown below. But my questions are about how sklearn
handles the special case described earlier.
#Joint optimisation
ensemble = VotingRegressor(
[ ('rf', RandomForestRegressor()), ('dt', DecisionTreeRegressor()) ]
)
jointsearch_param_dist = dict(
rf__n_estimators=[1, 2, 3, 4, 5],
dt__max_depth=[4, 5, 6, 7, 8]
)
ensemble_jointsearch = RandomizedSearchCV(ensemble, jointsearch_param_dist)
Solution
Q1) We have two unfitted estimators in parallel, and both need to be fitted before ensemble.predict(...) would work. But we can't fit any of the estimators without first getting a prediction from the ensemble. How does sklearn handle this circular dependency?
The sentence I've emphasized is incorrect. In this approach, the individual estimators are tuned for their own performance; the tuning doesn't know or care that eventually it will get used in an ensemble. After the optimal hyperparameters are found, the resulting models get ensembled.
For VotingRegressor
that's basically it. For StackingRegressor
there's a nested cross-validation happening, so the preceding paragraph is slightly wrong. The hyperparameter tuning happens on every fold from the stacking fitting procedure, with the selected model making predictions on the held out fold to make the training set for the meta-estimator. So in fact, different hyperparameters may be selected for each such split. The final ensemble uses yet another version of the model, tuned on the entire training set.
Q2) Since we have two estimators running independent tuning, does each estimator make the false assumption that the parameters of the other estimator are fixed? So we end up with a poorly-defined optimisation problem.
Given the above, this is moot: the two estimators don't know that the other even exists, let alone what hyperparameters will be selected for it. But you are right that there's a difference with your final approach.
For reference, I think the correct way to jointly optimise the models of an ensemble would be to define a single CV that searches over all parameters jointly, shown below. But my questions are about how sklearn handles the special case described earlier.
Both methods are "correct", it just depends on how much of the hyperparameter space you want to search. The former assumes that the optimal ensemble arises from optimal individual models, whereas the latter allows for an ensemble that performs better using suboptimal base models. The latter is more likely to be true, but is more computationally costly, and the former may or may not perform reasonably well in comparison. (And maybe the latter can even overfit more on your training set given its greater capacity.)
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.