Issue
Thank you for @Prune's critical comments on my questions.
I am trying to find the relationship between batch size and training time by using MNIST dataset.
By reading numerous questions in stackoverflow, such as this one: How does batch size impact time execution in neural networks? people said that the training time will be decreased when I use small batch size.
However, by trying out these two, I found that training with batch size == 1 takes way more time than batch size == 60,000. I set epoch as 10.
I split my MMIST dataset into 60k for the training and 10k for the testing.
This below is my code and results.
mnist_trainset = torchvision.datasets.MNIST(root=root_dir, train=True,
download=True,
transform=transforms.Compose([transforms.ToTensor()]))
mnist_testset = torchvision.datasets.MNIST(root=root_dir,
train=False,
download=True,
transform=transforms.Compose([transforms.ToTensor()]))
train_dataloader = torch.utils.data.DataLoader(mnist_trainset,
batch_size=1,
shuffle=True)
test_dataloader = torch.utils.data.DataLoader(mnist_testset,
batch_size=50,
shuffle=False)
# Define the model
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear_1 = torch.nn.Linear(784, 256)
self.linear_2 = torch.nn.Linear(256, 10)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, x):
x = x.reshape(x.size(0), -1)
x = self.linear_1(x)
x = self.sigmoid(x)
pred = self.linear_2(x)
return pred
# trainer
no_epochs = 10
def my_trainer(optimizer, model):
criterion = torch.nn.CrossEntropyLoss()
train_loss = list()
test_loss = list()
test_acc = list()
best_test_loss = 1
for epoch in range(no_epochs):
# timer starts
start = timer()
total_train_loss = 0
total_test_loss = 0
# training
# set up training mode
model.train()
for itr, (image, label) in enumerate(train_dataloader):
optimizer.zero_grad()
pred = model(image)
loss = criterion(pred, label)
total_train_loss += loss.item()
loss.backward()
optimizer.step()
total_train_loss = total_train_loss / (itr + 1)
train_loss.append(total_train_loss)
# testing
# change to evaluation mode
model.eval()
total = 0
for itr, (image, label) in enumerate(test_dataloader):
pred = model(image)
loss = criterion(pred, label)
total_test_loss += loss.item()
# we now need softmax because we are testing.
pred = torch.nn.functional.softmax(pred, dim=1)
for i, p in enumerate(pred):
if label[i] == torch.max(p.data, 0)[1]:
total = total + 1
# caculate accuracy
accuracy = total / len(mnist_testset)
# append accuracy here
test_acc.append(accuracy)
# append test loss here
total_test_loss = total_test_loss / (itr + 1)
test_loss.append(total_test_loss)
print('\nEpoch: {}/{}, Train Loss: {:.8f}, Test Loss: {:.8f}, Test Accuracy: {:.8f}'.format(epoch + 1, no_epochs, total_train_loss, total_test_loss, accuracy))
if total_test_loss < best_test_loss:
best_test_loss = total_test_loss
print("Saving the model state dictionary for Epoch: {} with Test loss: {:.8f}".format(epoch + 1, total_test_loss))
torch.save(model.state_dict(), "model.dth")
# timer finishes
end = timer()
print(end - start)
return no_epochs, test_acc, test_loss
model_sgd = Model()
optimizer_SGD = torch.optim.SGD(model_sgd.parameters(), lr=0.1)
sgd_no_epochs, sgd_test_acc, sgd_test_loss = my_trainer(optimizer=optimizer_SGD, model=model_sgd)
I calculated how much time did it took for each epoch.
And this below is the result.
Epoch: 1/10, Train Loss: 0.23193890, Test Loss: 0.12670580, Test Accuracy: 0.96230000
63.98903721500005 seconds
Epoch: 2/10, Train Loss: 0.10275097, Test Loss: 0.10111042, Test Accuracy: 0.96730000
63.97179028100004 seconds
Epoch: 3/10, Train Loss: 0.07269370, Test Loss: 0.09668248, Test Accuracy: 0.97150000
63.969843954 seconds
Epoch: 4/10, Train Loss: 0.05658571, Test Loss: 0.09841745, Test Accuracy: 0.97070000
64.24135530400008 seconds
Epoch: 5/10, Train Loss: 0.04183391, Test Loss: 0.09828428, Test Accuracy: 0.97230000
64.19695308500013 seconds
Epoch: 6/10, Train Loss: 0.03393899, Test Loss: 0.08982467, Test Accuracy: 0.97530000
63.96944059600014 seconds
Epoch: 7/10, Train Loss: 0.02808819, Test Loss: 0.08597597, Test Accuracy: 0.97700000
63.59837343000004 seconds
Epoch: 8/10, Train Loss: 0.01859330, Test Loss: 0.07529452, Test Accuracy: 0.97950000
63.591578820999985 seconds
Epoch: 9/10, Train Loss: 0.01383720, Test Loss: 0.08568452, Test Accuracy: 0.97820000
63.66664020100029
Epoch: 10/10, Train Loss: 0.00911216, Test Loss: 0.07377760, Test Accuracy: 0.98060000
63.92636473799985 seconds
After this I changed the batch size to 60000 and run the same program again.
train_dataloader = torch.utils.data.DataLoader(mnist_trainset,
batch_size=60000,
shuffle=True)
test_dataloader = torch.utils.data.DataLoader(mnist_testset,
batch_size=50,
shuffle=False)
print("\n===== Entering SGD optimizer =====\n")
model_sgd = Model()
optimizer_SGD = torch.optim.SGD(model_sgd.parameters(), lr=0.1)
sgd_no_epochs, sgd_test_acc, sgd_test_loss = my_trainer(optimizer=optimizer_SGD, model=model_sgd)
I got this result for batch size == 60000
Epoch: 1/10, Train Loss: 2.32325006, Test Loss: 2.30074144, Test Accuracy: 0.11740000
6.54154992299982 seconds
Epoch: 2/10, Train Loss: 2.30010080, Test Loss: 2.29524792, Test Accuracy: 0.11790000
6.341824101999919 seconds
Epoch: 3/10, Train Loss: 2.29514933, Test Loss: 2.29183527, Test Accuracy: 0.11410000
6.161918789000083 seconds
Epoch: 4/10, Train Loss: 2.29196787, Test Loss: 2.28874513, Test Accuracy: 0.11450000
6.180891567999879 seconds
Epoch: 5/10, Train Loss: 2.28899717, Test Loss: 2.28571669, Test Accuracy: 0.11570000
6.1449509030003355 seconds
Epoch: 6/10, Train Loss: 2.28604794, Test Loss: 2.28270152, Test Accuracy: 0.11780000
6.311743144000047 seconds
Epoch: 7/10, Train Loss: 2.28307867, Test Loss: 2.27968731, Test Accuracy: 0.12250000
6.060618773999977 seconds
Epoch: 8/10, Train Loss: 2.28014660, Test Loss: 2.27666961, Test Accuracy: 0.12890000
6.171511712999745 seconds
Epoch: 9/10, Train Loss: 2.27718973, Test Loss: 2.27364607, Test Accuracy: 0.13930000
6.164125173999764 seconds
Epoch: 10/10, Train Loss: 2.27423453, Test Loss: 2.27061504, Test Accuracy: 0.15350000
6.077817454000069 seconds
As you can see it is clear that it took more time for each epoch when batch_size == 1 which is different from what I have seen.
Maybe I am confused with the training time per epoch vs the training time until convergence? Seems like my intuition is correct by looking at this webpage: https://medium.com/deep-learning-experiments/effect-of-batch-size-on-neural-net-training-c5ae8516e57
Can someone please explain what is happening?
Solution
This is a borderline question; you should still be able to extract this understanding from the basic literature ... eventually.
Your insight is exactly correct: you are measuring execution time per epoch, rather than total Time-to-Train (TTT). You have also carried the generic "smaller batches" advice ad absurdum: a batch size of 1 is almost guaranteed to be sub-optimal.
The mechanics are very simple at a macro level.
With a batch size of 60k (the entire training set), you run all 60k images through the model, average their results, and then do one back-propagation for that average result. This tends to lose the learning you can get from focusing on little-seen features.
With a batch size of 1, you run each image individually through the model, average the one result (a very simple operation :-) ), and do a back propagation. This tends to over-emphasize individual effects, especially retaining superstitious effects from each single image. It also gives too much weight to the initial assumptions of the first few images.
The most obvious effect of the tiny batch size is that you're doing 60k back-props instead of 1, so each epoch takes much longer.
Either of these approaches is an extreme case, usually absurd in application.
You need to experiment to find the "sweet spot" that gives you the fastest convergence to acceptable (near-optimal) accuracy. There are a few considerations in choosing your experimental design:
- Memory size: you want to be able to ingest the entire batch into memory at once. This allows your model to pipeline reading and processing. If you exceed available memory, you will lose a lot of time to swapping. If you under-use the memory, you leave some potential performance untapped.
- Processors: if you're on a multi-processor chip, you want to keep them all busy. If you care to assign processors through your OS controls, you'll also want to play with how many to assign to model computation, and how many to assign to I/O and system use. For instance, in one project I did, our group found that our 32 cores were best used with 28 allocated to computation, 4 reserved for I/O and other system functions.
- Scaling: some characteristics work best in powers of 2. You may find that a batch size that is 2^n or 3 * 2^n for some n, works best, simply because of block sizes and other system allocations.
The experimental design that has worked best for me over the years is to start with a power of 2 that is roughly the square root of the training set size. For you, there's an obvious starting guess of 256. Thus, you'd run experiments at perhaps 64, 128, 256, 512, and 1024. See which ones give you the fastest convergence.
Then do one step of refinement, using that factor of 3. For instance, if you find that the best performance comes at 128, also try 96 and 192.
You will likely see very little difference between your "sweet spot" and the adjacent batch sizes; this is the nature of most complex information systems.
Answered By - Prune
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.