Issue
I am trying to put together an ML pipeline in Python (using Sklearn
, open to alternative package suggestions) where I have 5 categorical feature variables, 2 continuous feature variables, and an ordinal target variable with the following value counts:
0.0 35063
1.0 1073
2.0 496
3.0 52
4.0 13
5.0 4
6.0 2
As you might have already caught, the trick here is that approximately ~95% of the target variable is comprised of a 0.0
label. I have put together a pipeline where I am OneHotEncoding
the categorical feature variables and StandardScaling
the continuous feature variables.
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_features),
('num', continuous_transformer, continuous_features)
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
And later applying the following split:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Using Sklearn
metrics' accuracy_score
functionality, it appears that I achieve a 94% overall model accuracy which is great. However I am worried that due to the skew in the target variable, this model becomes prone to fitting problems. I would really appreciate some insight here.
Thanks all!
Solution
Consider the following points:
Class imbalance: A naïve classifier that always predicts the majority class would be correct 95.5% of the time. Therefore, if your classifier shows an accuracy of 94%, it may not be performing better than a naïve approach. Explore methods to manage the target class imbalance, such as undersampling or oversampling.
Classifier for an ordinal target: The
RandomForestClassifier
does not account for the ordinal nature of the target variable. For algorithms better suited to ordinal targets, refer to this discussion: Multi-class, multi-label, ordinal classification with sklearnMetric: As indicated,
accuracy_score
may not be the optimal metric for your scenario. A highaccuracy_score
does not guarantee a useful classifier. Furthermore, it disregards the ordinal nature of your target variable. For example, anaccuracy_score
treats predicting0.0
instead of6.0
the same as predicting5.0
instead of6.0
. Investigate metrics that more accurately reflect the cost of misclassifying an ordinal target: Measures of ordinal classification error for ordinal regressionSplitting: As desired when dealing with imbalanced data,
train_test_split
by default splits in a stratified way. This means that each split will contain roughly the same proportion of each class in each split. When implementing cross-validation, make sure to use a stratified approach as well, for example withStratifiedKFold
.
With a little bit of research on these points I am certain you will find a good solution to your problem.
Answered By - DataJanitor
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.