Issue
When I try the below I get the error ValueError: some parameters appear in more than one parameter group
. However, inspecting the model it is not clear to me what is the overlapping module.
The only possiblity as to why it might think so may be because lm_head
and transformer.wte
have parameters named weight
. I'm wondering if this name is what is causing this error.
I am doing this so that I can have the lower layers "moving slowly" compared to the upper layers. Happy to hear if there is an alternative way to do these discriminative learning rates where I don't have overlapping parameters (if any).
import torch
from transformers import AutoModelForCausalLM
language_model = AutoModelForCausalLM.from_pretrained("gpt2")
FREEZE_LAYERS = 2
caption_params = [
{"params": language_model.lm_head.parameters() , "lr": 1e-4},
{"params": language_model.transformer.ln_f.parameters() , "lr": 1e-4},
{"params": language_model.transformer.h[FREEZE_LAYERS:].parameters() , "lr": 5e-5},
{"params": language_model.transformer.wte.parameters() , "lr": 1e-5},
]
optimizer = torch.optim.Adam(caption_params)
Solution
The error message is diagnosing the problem correctly: there are some parameters that appear in more than one parameter group. You can prove this to yourself by doing the following:
>>> parameter_ids = [[id(p) for p in group["params"]] for group in caption_params]
>>> parameter_ids[0]
[140666221372896]
>>> parameter_ids[3]
[140666221372896]
This reveals that the first and last parameter groups, each of which contains a single large embedding tensor, are actually holding a reference to the same exact tensor. What is this tensor? Let's look at it, using both routes of reference to further show it's the same thing:
>>> a = next(language_model.lm_head.parameters())
>>> a
Parameter containing:
tensor([[-0.1101, -0.0393, 0.0331, ..., -0.1364, 0.0151, 0.0453],
[ 0.0403, -0.0486, 0.0462, ..., 0.0861, 0.0025, 0.0432],
[-0.1275, 0.0479, 0.1841, ..., 0.0899, -0.1297, -0.0879],
...,
[-0.0445, -0.0548, 0.0123, ..., 0.1044, 0.0978, -0.0695],
[ 0.1860, 0.0167, 0.0461, ..., -0.0963, 0.0785, -0.0225],
[ 0.0514, -0.0277, 0.0499, ..., 0.0070, 0.1552, 0.1207]],
requires_grad=True)
>>> b = next(language_model.transformer.wte.parameters())
>>> b
Parameter containing:
tensor([[-0.1101, -0.0393, 0.0331, ..., -0.1364, 0.0151, 0.0453],
[ 0.0403, -0.0486, 0.0462, ..., 0.0861, 0.0025, 0.0432],
[-0.1275, 0.0479, 0.1841, ..., 0.0899, -0.1297, -0.0879],
...,
[-0.0445, -0.0548, 0.0123, ..., 0.1044, 0.0978, -0.0695],
[ 0.1860, 0.0167, 0.0461, ..., -0.0963, 0.0785, -0.0225],
[ 0.0514, -0.0277, 0.0499, ..., 0.0070, 0.1552, 0.1207]],
requires_grad=True)
>>> a is b
True
This makes sense, because many Transformer-based models tie the weights used in mapping between word IDs and word representations at the beginning (the initial Embedding
layer) and end (the LM head) of the model.
For your specific problem, you can either accept that the tied weights will be moving at the same LR, or you can untie them by cloning and assigning a new copy of the parameter to one of the two modules.
Answered By - Luke G
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.