Issue
I'm a beginner to PyTorch and am trying to train a MNIST model based on a custom neural network class. My learning rate scheduler, loss function and optimizer are:
optimizer = optim.Adam(model.parameters(), lr=0.003)
loss_fn = nn.CrossEntropyLoss()
exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
I'm also using Learning Rate scheduler for that purpose. Initially, I had my training loop like this:
# this training gives high loss and it doesn't varies that much
def training(epochs):
model.train()
for batch_idx, (imgs, labels) in enumerate(train_loader):
imgs = imgs.to(device=device)
labels = labels.to(device=device)
optimizer.zero_grad()
outputs = model(imgs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
exp_lr_scheduler.step() # inside the loop and after the optimizer
if (batch_idx + 1)% 100 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, (batch_idx + 1) * len(imgs), len(train_loader.dataset),
100. * (batch_idx + 1) / len(train_loader), loss.data))
But this training was not efficient and my loss was almost the same in every epoch.
Then, I changed my training function to this in the end:
# this training works perfectly
def training(epochs):
model.train()
exp_lr_scheduler.step() # out of the loop but before optimizer step
for batch_idx, (imgs, labels) in enumerate(train_loader):
imgs = imgs.to(device=device)
labels = labels.to(device=device)
optimizer.zero_grad()
outputs = model(imgs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
if (batch_idx + 1)% 100 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, (batch_idx + 1) * len(imgs), len(train_loader.dataset),
100. * (batch_idx + 1) / len(train_loader), loss.data))
And now, it's working correctly. I just don't get the reason for this. I have two queries:
- Shouldn't
exp_lr_scheduler.step()
be in the for loop so that it also get's updated with every epoch? ; and - PyTorch latest version says to keep
exp_lr_scheduler.step()
afteroptimizer.step()
but doing this in the my training function gives me worse loss.
What can be the reason or am I doing it wrong?
Solution
StepLR updates the learning rate after every step_size by gamma, that means if step_size is 7 then learning rate will be updated after every 7 epoch by multiplying the current learning rate to gamma. That means that in your snippet, the learning rate is getting 10 times smaller every 7 epochs.
Have you tried increasing the starting learning rate? I would try 0.1 or 0.01. I think the problem could be at the size of the starting learning rate since the starting point it is already quite small. This causes that the gradient descent algorithm (or its derivatives, as Adam) cannot move towards the minimum because the step is too small and your results keep being the same (at the same point of the functional to minimize).
Hope it helps.
Answered By - David Serrano
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.