Saturday, June 25, 2022

[FIXED] Pytorch - If you detach a nn.module in the middle of a network do all the modules prior to that one not get their gradient calculated?

June 25, 2022 pytorch No comments

Issue

So let's say I have X an input and a sequential network of net A, net B and net C. If I detach net B and I put X through A->B->C, because B is detached do I lose gradient information from A? I would assume no? I'm assuming it would just treat B like a constant to be added to the output of A rather than something differentiable.

Solution

TLDR; Preventing gradient computation on B won't stop computing gradients for the upstream network A.

I think there is some confusion on what you consider "detaching a model". In my opinion, there are three things to keep in mind with this kind of thing:

You can detach a tensor which effectively detaches it from the computational graph, i.e. if this tensor is used to compute another tensor requiring gradient, the backpropagation step will not propagate past this "detached" tensor.
In your way of describing "detaching a model", you can disable gradient computation on given layers of your network by switching the requires_grad to False on its parameters. This can done in a single line at the module level with nn.Module.requires_grad_. So in your case doing B.requires_grad_(False) will freeze the parameters of B such that they can't be updated. In other words, the gradients of the parameters of B won't be computed however the intermediate gradients used to propagate to A will! Here is a minimal example:
```
>>> A = nn.Linear(10,10)
>>> B = nn.Linear(10,10)
>>> C = nn.Linear(10,10)

# disable gradient computation on B
>>> B.requires_grad_(False)

# dummy input, inference, and backpropagation
>>> x = torch.rand(1,10, requires_grad=True)
>>> C(B(A(x))).mean().backward()
```
We can now check that gradients of C and A have indeed be filled properly:
```
>>> A.weight.grad.sum()
tensor(0.3281)

>>> C.weight.grad.sum()
tensor(-1.6335)
```
However of course, B.weight.grad returns None.
Lastly, yet another behaviour is when using the no_grad context manager. This effectively kills the gradient. If you do something like:
```
>>> yA = A(x)
>>> with torch.no_grad():
...    yB = B(yA)
>>> yC = C(yB)
```
Here yC is already detached from the network.

Answered By - Ivan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, June 25, 2022

[FIXED] Pytorch - If you detach a nn.module in the middle of a network do all the modules prior to that one not get their gradient calculated?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels