Issue
I am looking at the example from pytorch of a model:
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
running_loss = 0.0
print('Finished Training')
And I have a very basic question - the optimizer was never inserted or defined into the model (similarly to model.compile
in keras
). Nor it received the loss or labels of the last batch or epoch.
How does it "knows" to perform optimization step?
Solution
Rather than thinking about how loss and parameters are related, you should consider them as separate events which are not linked. Indeed, there are two distinct elements that have an effect on parameters and their cached gradient.
The autograd mechanism (the process in charge of performing gradient computation) allows you to call
backward
on atorch.Tensor
(your loss) and which will in turn backpropagate through all the nodes tensors that are allowed to compute this final tensor value. Doing so, it will navigate through what's called the computation graph, updating each of the parameters' gradients by changing theirgrad
attribute. This means that at the end of abackward
call the network's learned parameters that were used to compute this output will have agrad
attribute containing the gradient of the loss with respect to that parameter.loss.backward()
The optimizer is independent of the backward pass since it doesn't rely on it. You can call backward on your graph once, multiple times, or on different loss terms depending on your use case. The optimizer's task is to take the parameters of the model independently (that is irrespective of the network architecture or its computation graph) and update them using a given optimization routine (for example via Stochastic Gradient Descent, Root Mean Squared Propagation, etc...). It goes through all parameters it was initialized with and updates them using their respective gradient value (which is supposed to be stored in the
grad
attribute by at least one backpropagation.optimizer.step()
Important notes:
Keep in mind though that the backward process and the actual update call using the optimizer are linked implicitly only by the fact that the optimizer will use the results computed by the backward preceding call.
In PyTorch parameter gradients are kept in memory so you have to clear them out before performing a new backward call. This is done using the optimizer's
zero_grad
function. In practice, it clears thegrad
attribute of the tensors it has registered as parameters.
Answered By - Ivan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.