Issue
Suppose I have a continuous predictor variable with values of 10, 20, 20, 30. I understand that the set of potential split thresholds would include {15, 25}, as these are the means of 10 & 20 and of 20 & 30, respectively. But would 20 also be included as a potential split threshold because it is the mean of 20 & 20, or do repeated values in the sorted array get skipped?
Note that I'm not asking about the metric used to select the best split threshold (gini, entropy, log-loss, etc.). I'm asking about the upstream process of identifying the potential thresholds that will be evaluated with this metric.
My coding skills aren't strong enough to understand the scikit-learn source code, but I think this information might be found here. I cannot find anything in the documentation itself about this though.
Solution
No, in your example 20 is not considered as a valid split point. Since the splits are taken as f_i <= threshold
vs f_i > threshold
, in your example a threshold of 20 and a threshold of 25 are actually the same anyway.
In the code that you linked (I'm looking at BestSplitter
), after sorting the feature values, it loops through the indices p
, but skips over those with equal values:
while p + 1 < end and Xf[p + 1] <= Xf[p] + FEATURE_THRESHOLD:
p += 1
[source] (FEATURE_THRESHOLD
is very small and handles precision issues)
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.