Monday, January 10, 2022

[FIXED] With MLPCLassifier, using partial_fit several times yields worst accuracy than using fit(), despite having shuffled data

January 10, 2022 machine-learning, neural-network, python, scikit-learn No comments

Issue

I'm using sklearn's MLPClassifier to build a neural network in Python for a classification task. I would like to plot a curve of the accuracy against the number of epochs, to see how many epochs I need to have some level of accuracy. The only way I've been able to do this is to use partial_fit() in a loop. Here is the code the does this:

from sklearn.preprocessing   import StandardScaler
from sklearn.decomposition   import PCA
from sklearn.neural_network  import MLPClassifier
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt

scaler = StandardScaler()
scaler.fit(df_train_sample)
X_train = scaler.transform(df_train_sample)
scaler.fit(df_val)
X_val = scaler.transform(df_val)

pca = PCA(pca_frac)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_val = pca.transform(X_val)

n_classes = np.unique(labels_train_sample)
n_train_sample = len(df_train_sample)

scores_train = []
scores_val = []

epoch = 0
while epoch < max_iter:
   
    random_perm = np.random.permutation(n_train_sample)
    mini_batch_index = 0

    while True:
        indices = random_perm[mini_batch_index:mini_batch_index + batch_size]
        mlpc.partial_fit(X_train[indices], labels_train_sample[indices], classes=n_classes)
        mini_batch_index += batch_size

        if mini_batch_index >= n_train_sample:
            break
    
    scores_train.append(mlpc.score(X_train, labels_train_sample))
    scores_val.append(mlpc.score(X_val, labels_val))

    epoch += 1

fig, ax = plt.subplots()

ax.plot(np.arange(1, max_iter + 1), scores_train, label = "Train")
ax.plot(np.arange(1, max_iter + 1), scores_val, label = "Validation")

Here, max_iter is the number of epochs, and mlpc is the classifier, defined as follows:

seed          = 123
hidden_layers = [30, 15]
activation    = "relu"
learning_rate = 5e-4
beta_1        = 0.99
epsilon       = 1e-4

batch_size    = 200 
max_iter      = 200 
tol           = 1e-4

warm_start    = True
shuffle       = True

mlpc = MLPClassifier(
    hidden_layer_sizes = hidden_layers,
    activation         = activation,
    batch_size         = batch_size,
    learning_rate_init = learning_rate,
    beta_1             = beta_1,
    epsilon            = epsilon,
    warm_start         = warm_start,
    shuffle            = shuffle,
    max_iter           = max_iter,
    tol                = tol,
    random_state       = seed
)

Just to be sure, here is how df_train_sample and labels_train_sample are constructed from the original dataframe:

df_train_sample = df_train.sample(N, replace = False).reset_index(drop = True)
labels_train_sample = labels_train[df_train_sample.index].reset_index(drop = True)

where N is the number of rows to sample. df_val and labels_val are the validation data, and are read directly from a .csv file without modifications. Note that the labels are booleans.

The problem is that the algorithm, if called with mlpc.fit(), yields an accuracy of about 82% on the sampled dataset, while the accuracy of the piece of code I've posted is 65%. Here is the plot:

Searching online I've found that shuffling the data can help, but as you can see the data is already shuffled every epoch. Why is this the case? Is there another way of building said plot in another, more straightforward way?

Solution

I've found out what is the problem. The problem is not in partial_fit(), but in the way I've built the sample dataframes:

df_train_sample = df_train.sample(N, replace = False).reset_index(drop = True)
labels_train_sample = labels_train[df_train_sample.index].reset_index(drop = True)

In this part I reset the index of df_train_sample as I build it, but then I use its index to sample the corresponding rows from labels_train. This would work if I didn't reset the index (which is what I used to do in a previous version).

The solution is simply to store the index before resetting it, like this

df_train_sample = df_train.sample(N, replace = False)
train_index = df_train_sample.index
df_train_sample = df_train_sample.reset_index(drop = True)
labels_train_sample = labels_train[train_index].reset_index(drop = True)

Answered By - ultrapoci

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 10, 2022

[FIXED] With MLPCLassifier, using partial_fit several times yields worst accuracy than using fit(), despite having shuffled data

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels