Issue
Here is the help of sklearn.ensemble.RandomForestClassifier.fit()
. It is not clear whether there can be a problem when X and y are sorted by labels. My preliminary test suggests that it does not matter whether X and y are sorted.
Is my conclusion correct?
Help on class RandomForestClassifier in module sklearn.ensemble._forest:
class RandomForestClassifier(ForestClassifier)
...
| Build a forest of trees from the training set (X, y).
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The training input samples. Internally, its dtype will be converted
| to ``dtype=np.float32``. If a sparse matrix is provided, it will be
| converted into a sparse ``csc_matrix``.
|
| y : array-like of shape (n_samples,) or (n_samples, n_outputs)
| The target values (class labels in classification, real numbers in
| regression).
Solution
It does not matter in the case of RandomForestClassifier
.
Random forest is an ensemble of weak learners that perform a majority voting.
As we need to have different trees that take their decisions based on different features, the algorithm is using Bootstrapping
(argument bootstrap=True
inRandomForestClassifier
) which is performing a random sampling with replacement. In addition to the bootstrap samples, we also draw random subsets of features for training the individual trees
Bootsrapping
is essential to Random Forest. Without it, all trees would be more or less similar and based on the same features. This would destroy the whole purpose of the majority voting.
Therefore we can say that the order of the samples do not matter. However, as desertnaut said in their comment, it is always better to shuffle the data to avoid other potential problems.
Note: Statquest videos on the subject are really nice to understand how it works in depth.
Answered By - Antoine Dubuis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.