Issue
While constructing each tree in the random forest using bootstrapped samples, for each terminal node, we select m variables at random from p variables to find the best split (p is the total number of features in your data). My questions (for RandomForestRegressor) are:
1) What does max_features correspond to (m or p or something else)?
2) Are m variables selected at random from max_features variables (what is the value of m)?
3) If max_features corresponds to m, then why would I want to set it equal to p for regression (the default)? Where is the randomness with this setting (i.e., how is it different from bagging)?
Thanks.
Solution
Straight from the documentation:
[
max_features
] is the size of the random subsets of features to consider when splitting a node.
So max_features
is what you call m. When max_features="auto"
, m = p and no feature subset selection is performed in the trees, so the "random forest" is actually a bagged ensemble of ordinary regression trees. The docs go on to say that
Empirical good default values are
max_features=n_features
for regression problems, andmax_features=sqrt(n_features)
for classification tasks
By setting max_features
differently, you'll get a "true" random forest.
Answered By - Fred Foo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.