Issue
I am trying to plot SHAP
This is my code rnd_clf
is a RandomForestClassifier
:
import shap
explainer = shap.TreeExplainer(rnd_clf)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values[1], X)
I understand that shap_values[0]
is negative and shap_values[1]
is positive.
But what about for multiple class RandomForestClassifier? I have the rnd_clf
classifying one of:
['Gusto','Kestrel 200 SCI Older Road Bike', 'Vilano Aluminum Road Bike 21 Speed Shimano', 'Fixie'].
How do I determine which index of shap_values[i]
corresponds to which class of my output?
Solution
How do I determine which index of shap_values[i] corresponds to which class of my output?
shap_values[i]
are SHAP values for i'th class. What is an i'th class is more a question of an encoding schema you use: LabelEncoder
, pd.factorize
, etc.
You may try the following as a clue:
from sklearn.preprocessing import LabelEncoder
labels = [
"Gusto",
"Kestrel 200 SCI Older Road Bike",
"Vilano Aluminum Road Bike 21 Speed Shimano",
"Fixie",
]
le = LabelEncoder()
y = le.fit_transform(labels)
encoding_scheme = dict(zip(y, labels))
pprint(encoding_scheme)
{0: 'Fixie',
1: 'Gusto',
2: 'Kestrel 200 SCI Older Road Bike',
3: 'Vilano Aluminum Road Bike 21 Speed Shimano'}
So, eg shap_values[3]
for this particular case is for 'Vilano Aluminum Road Bike 21 Speed Shimano'
To further understand how to interpret SHAP values let's prepare a synthetic dataset for multiclass classification with 100 features and 10 classes:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap import summary_plot
X, y = make_classification(1000, 100, n_informative=8, n_classes=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
(750, 100)
At this point we have train dataset with 750 rows, 100 features, and 10 classes.
Let's train RandomForestClassifier
and feed it to TreeExplainer
:
clf = RandomForestClassifier(n_estimators=100, max_depth=3)
clf.fit(X_train, y_train)
explainer = TreeExplainer(clf)
shap_values = np.array(explainer.shap_values(X_train))
print(shap_values.shape)
(10, 750, 100)
10 : number of classes. All SHAP values are organized into 10 arrays, 1 array per class.
750 : number of datapoints. We have local SHAP values per datapoint.
100 : number of features. We have SHAP value per every feature.
For example, for Class 3
you'll have:
print(shap_values[3].shape)
(750, 100)
750: SHAP values for every datapoint
100: SHAP value contributions for every feature
Finally, you can run a sanity check to make it sure real predictions from model are the same as those predicted by shap
.
To do so, we'll (1) swap the first 2 dimensions of shap_values
, (2) sum up SHAP values per class for all features, (3) add SHAP values to base values:
shap_values_ = shap_values.transpose((1,0,2))
np.allclose(
clf.predict_proba(X_train),
shap_values_.sum(2) + explainer.expected_value
)
True
Then you may proceed to summary_plot
that will show feature rankings based on SHAP values on a per class basis. For class 3 this will be:
summary_plot(shap_values[3],X_train)
Which is interpreted as follows:
For class 3 most influential features based on SHAP contributions are 16,59,24
For feature 15 lower values tend to result in higher SHAP values (hence higher probability of the class label)
Features 50, 45, 48 are least influential out of 20 displayed
Answered By - Sergey Bushmanov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.