Issue
I was going through sklearn class DecisionTreeClassifier.
Looking at parameters for the class, we have two parameters min_samples_split and min_samples_leaf. Basic idea behind them looks similar, you specify a minimum number of samples required to decide a node to be leaf or split further.
Why do we need two parameters when one implies the other?. Is there any reason or scenario which distinguish them?.
Solution
From the documentation:
The main difference between the two is that
min_samples_leaf
guarantees a minimum number of samples in a leaf, whilemin_samples_split
can create arbitrary small leaves, thoughmin_samples_split
is more common in the literature.
To get a grasp of this piece of documentation I think you should make the distinction between a leaf (also called external node) and an internal node. An internal node will have further splits (also called children), while a leaf is by definition a node without any children (without any further splits).
min_samples_split
specifies the minimum number of samples required to split an internal node, while min_samples_leaf
specifies the minimum number of samples required to be at a leaf node.
For instance, if min_samples_split = 5
, and there are 7 samples at an internal node, then the split is allowed. But let's say the split results in two leaves, one with 1 sample, and another with 6 samples. If min_samples_leaf = 2
, then the split won't be allowed (even if the internal node has 7 samples) because one of the leaves resulted will have less then the minimum number of samples required to be at a leaf node.
As the documentation referenced above mentions, min_samples_leaf
guarantees a minimum number of samples in every leaf, no matter the value of min_samples_split
.
Answered By - Alex
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.