Issue
I understand autograd
is used to imply automatic differentiation. But what exactly is tape-based autograd
in Pytorch
and why there are so many discussions that affirm or deny it.
For example:
In pytorch, there is no traditional sense of tape
and this
We don’t really build gradient tapes per se. But graphs.
but not this
Autograd is now a core torch package for automatic differentiation. It uses a tape based system for automatic differentiation.
And for further reference, please compare it with GradientTape
in Tensorflow
.
Solution
There are different types of automatic differentiation e.g. forward-mode
, reverse-mode
, hybrids
; (more explanation). The tape-based
autograd in Pytorch
simply refers to the uses of reverse-mode automatic differentiation, source. The reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation, source.
Now, in PyTorch, Autograd is the core torch package for automatic differentiation. It uses a tape-based
system for automatic differentiation. In the forward phase, the autograd
tape will remember all the operations it executed, and in the backward phase, it will replay the operations.
Same in TensorFlow, to differentiate automatically, It also needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients. Now, TensorFlow provides the tf.GradientTape
API for automatic differentiation; that is, computing the gradient of computation with respect to some inputs, usually tf.Variables
. TensorFlow records relevant operations executed inside the context of a tf.GradientTape
onto a tape. TensorFlow then uses that tape to compute the gradients of a recorded computation using reverse mode differentiation.
So, as we can see from the high-level viewpoint, both are doing the same operation. However, during the custom training loop, the forward
pass and calculation of the loss
are more explicit in TensorFlow
as it uses tf.GradientTape
API scope whereas in PyTorch
it's implicit for these operations but it requires to set required_grad
flags to False
temporarily while updating the training parameters (weights and biases). For that, it uses torch.no_grad
API explicitly. In other words, TensorFlow's tf.GradientTape()
is similar to PyTorch's loss.backward()
. Below is the simplistic form in the code of the above statements.
# TensorFlow
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
with tf.GradientTape() as tape:
# forward passing and loss calculations
# within explicit tape scope
predictions = tf_model(x)
loss = squared_error(predictions, y)
# compute gradients (grad)
w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)
# update training variables
w.assign(w - w_grad * learning_rate)
b.assign(b - b_grad * learning_rate)
# PyTorch
[w, b] = torch_model.parameters()
for epoch in range(epochs):
# forward pass and loss calculation
# implicit tape-based AD
y_pred = torch_model(inputs)
loss = squared_error(y_pred, labels)
# compute gradients (grad)
loss.backward()
# update training variables / parameters
with torch.no_grad():
w -= w.grad * learning_rate
b -= b.grad * learning_rate
w.grad.zero_()
b.grad.zero_()
FYI, in the above, the trainable variables (w
, b
) are manually updated in both frameworks but we generally use an optimizer (e.g. adam
) to do the job.
# TensorFlow
# ....
# update training variables
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))
# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()
Answered By - M.Innat
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.