Saturday, May 14, 2022

[FIXED] Is RandomOverSampler Causing my Model to Overfit?

May 14, 2022 multilabel-classification, overfitting-underfitting, python, scikit-learn No comments

Issue

I am attempting to see how well I can classify books according to genre using Tfidf vectorization. I am using five moderately imbalanced genre labels, and I want to use multi label classification to assign each document one or more genres. Initially my performance was middling, so I tried to fix this by re-balancing the classes with RandomOverSampler, and my cross validated f1_macro score shot up from 0.415 to 0.842.

I have read here that improperly combining resampling with cross validation can cause your model to overfit. So I want to make sure I'm not doing that here.

def preprocess_text(text):
    try:
        text = re.sub('[^a-zA-Z]', ' ', text)
        text = text.lower().split()
        text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))]
        text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1]
        return ' '.join(text)
    except TypeError:
        return ''

def preprocess_series(series):
    texts = []
    for i in range(len(series)):
        texts.append(preprocess_text(series[i]))
    return pd.Series(texts)

books_data = pd.DataFrame([
    ["A_Likely_Story.txt", "fantasy fiction:science fiction", "If you discovered a fantastic power like thi..."],
    ["All_Cats_Are_Gray.txt", "science fiction", "An odd story, made up of oddly assorted elem..."]
    ],columns=["title", "genre", "text"])

X = pd.DataFrame(preprocess_series(books_data["text"]),columns = ["text"])
Y = pd.Series([genres.split(":")[0] for genres in books_data["genre"]])

oversampler = RandomOverSampler()
x_ros, y_ros = oversampler.fit_resample(X, Y)

column_trans = compose.make_column_transformer(
    (TfidfVectorizer(ngram_range=(1,3)), "text")
)
ovr_svc_clf = multiclass.OneVsRestClassifier(svm.LinearSVC())

pipe = pipeline.make_pipeline(column_trans, ovr_svc_clf)

print(cross_val_score(
    pipe,
    X,
    Y,
    cv=3, 
    scoring="f1_macro"
).mean())

print(cross_val_score(
    pipe,
    x_ros,
    y_ros,
    cv=3, 
    scoring="f1_macro"
).mean())

Here is the distribution of my class labels. Is it small and imbalanced enough to cause overfitting?

Solution

Oversampling doesn't cause overfitting.

Oversampling before splitting for cross-validation causes data leakage, and the scores you're seeing are indeed not usable as estimates of future performance. Your test folds (probably) contain copies of the same data points included in training folds.

You can add the oversampling as a first step in the pipeline (and use the imblearn version of a pipeline, if you aren't already) to alleviate this issue.

All that said, try modeling without balancing, using a custom decision threshold or a threshold-independent metric.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 14, 2022

[FIXED] Is RandomOverSampler Causing my Model to Overfit?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels