Issue
I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight
array in the XGBClassifier, but I'm not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?
The class ratios:
military: 1171
government: 34852
other: 20869
Example:
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])
Sample of Dataset:
data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'],
'class': ['government', 'military', 'other']})
Classification Report
Solution
- First, most important: use a multiclass
eval_metric
.eval_metric=merror
ormlogloss
, then post us the results. You showed us['precision','recall','f1-score','support']
, but that's suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way. - Second, you need weights. Your class ratio is
military: government: other
1:30:18, or as percentages 2:61:37%. That's severely imbalanced.
- You can manually set per-class weights with
xgb.DMatrix..., weights)
- Look inside your pipeline (use print or verbose settings, dump values), don't just blindly rely on boilerplate like
sklearn.utils.class_weight.compute_sample_weight('balanced', ...)
to give you optimal weights. - Experiment with manually setting per-class weights, starting with
1 : 1/30 : 1/18
and try more extreme values. Reciprocals so the rarer class gets higher weight. - Also try setting
min_child_weight
much higher, so it requires a few exemplars (of the minority classes). Start withmin_child_weight >= 2
(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn't using CV).
- We can't see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults - they'll be terrible. Or else you're not showing your code. Distrust xgboost's defaults, esp. for multiclass, don't expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
- Do all that experimentation, post your results, check before concluding "it doesn't work". Don't expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn't mean it's good or best or suitable for all use-cases, like imbalanced multiclass)
Answered By - smci
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.