Issue
I am working in an NLP task for a classification problem. My dataset is imbalanced and some authors have only 1 text, thus I want to have this text only in the training set. As for the other authors I have to have a spliting of 70%, 15% and 15% respectively.
I tried to use train_test_split
function from sklearn
, but the results aren't good.
My dataset is a dataframe and it looks like this
Title Preprocessed_Text Label
Please let me know.
Solution
It is rather hard to obtain good classification results for a class that contains only 1 instance (at least for that specific class). Regardless, for imbalanced datasets, one should use stratified train_test_split
(using stratify=y
), which preserves the same proportions of instances in each class as observed in the original dataset.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
I should also add that if the dataset is rather small, let's say no more than 100 instances, it would be preferable to use cross-validation instead of train_test_split
, and more specifically, StratifiedKFold
or RepeatedStratifiedKFold
that returns stratified folds (see this answer to understand the difference between the two).
When it comes to evaluation, you should consider using metrics such as Precision, Recall and F1-score (the harmonic mean of the Precision and Recall), using the average weighted score for each of these, which uses a weight that depends on the number of true instances of each class. As per the documentation:
'weighted':
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
Answered By - Chris
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.