Issue
I have been constructing my own Extra Trees (XT) classifier in Rust for binary classification. To verify correctness of my classifier, I have been comparing it against Sklearns implementation of XT, but I constantly get different results. I thought that there must be a bug in my code at first, but now I realize it's not a bug, but instead a different method of calculating votes amongst the different trees in the ensemble. In my code, each tree votes based on the most frequent classification in a leafs' subset of data. So for example, if we are traversing a tree, and find ourselves at a leaf node that has 40 classifications of 0, and 60 classifications of 1, the tree classifies the data as 1.
Looking at Sklearn's documentation for XT (As seen here), I read the following line in regards to the predict method
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
While this gives me some idea about how individual trees vote, I still have more questions. Perhaps an exact mathematical expression of how these weights are calculated would help, but I have yet to find one in the documentation.
I will provide more details in the upcoming paragraphs, but I wish to ask my question concisely here. How are these weights calculated at a high level, what are the mathematics behind it? Is there a way to change how individual XT trees calculate their votes?
---------------------------------------- Additional Details -----------------------------------------------
For my current tests, this is how I build my classifier
classifier = ExtraTreesClassifier(n_estimators=5, criterion='gini',
max_depth=1, max_features=5,random_state=0)
To predict unseen transactions X, I use classifier.predict(X)
. Digging through the source code of predict (seen here, line 630-ish), I see that this is all the code that executes for binary classification
proba = self.predict_proba(X)
if self.n_outputs_ == 1:
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
What this code is doing is relatively obvious to me. It merely determines the most likely classification of transactions by taking the argmax of proba. What I fail to understand is how this proba value is made in the first place. I beleive that the predict_proba method that predict uses is defined here at Line 650-ish. Here is what I believe the relevant source code to be
check_is_fitted(self)
# Check data
X = self._validate_X_predict(X)
# Assign chunk of trees to jobs
n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
# avoid storing the output of every estimator by summing them here
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in np.atleast_1d(self.n_classes_)]
lock = threading.Lock()
Parallel(n_jobs=n_jobs, verbose=self.verbose,
**_joblib_parallel_args(require="sharedmem"))(
delayed(_accumulate_prediction)(e.predict_proba, X, all_proba,
lock)
for e in self.estimators_)
for proba in all_proba:
proba /= len(self.estimators_)
if len(all_proba) == 1:
return all_proba[0]
else:
return all_proba
I fail to understand what exactly is being calculated here. This is where my trail goes a bit cold and I get confused, and find myself in need of help.
Solution
Trees can predict probability estimates, according to the training sample proportions in each leaf. In your example, the probability of class 0 is 0.4, and 0.6 for class 1.
Random forests and extremely random trees in sklearn
perform soft voting: each tree predicts the class probabilities as above, and then the ensemble just averages those across trees. That produces a probability for each class, and then the predicted class is the one with the largest probability.
In the code, the relevant bit is _accumulate_predictions
, which just sums the probability estimates, followed by the division by the number of estimators.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.