I was doing some masked language modeling training with some old code and I got a strange error that took a long time to debug:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Eventually I located the root of this bug: the AdamW optimizer from HuggingFace, which is deprecated, caused it. For completeness, this was the call and the specific learning rate schedule that I had using Pytorch-Lightning.
from torch.optim.lr_scheduler import LambdaLR
from transformers import (
AdamW,
get_linear_schedule_with_warmup,
)
def configure_optimizers(self):
"Prepare optimizer and schedule (linear warmup and decay)"
= self.student
model = ["bias", "LayerNorm.weight"]
no_decay = [
optimizer_grouped_parameters
{"params": [
pfor n, p in model.named_parameters()
if not any(nd in n for nd in no_decay)
],"weight_decay": self.hparams.weight_decay,
},
{"params": [
pfor n, p in model.named_parameters()
if any(nd in n for nd in no_decay)
],"weight_decay": 0.0,
},
]= AdamW(
optimizer
optimizer_grouped_parameters,=self.hparams.learning_rate,
lr=self.hparams.adam_epsilon,
eps
)
= {
scheduler "scheduler": LambdaLR(
optimizer,=LRPolicy(
lr_lambdaself.hparams.warmup_steps,
self.trainer.estimated_stepping_batches,
),
),"interval": "step",
"frequency": 1,
"name": "learning_rate",
}return [optimizer], [scheduler]
The only thing that needs to change is the import of AdamW:
from transformers import (- AdamW,
get_linear_schedule_with_warmup,
)+from torch.optim import AdamW
Since the error of this bug led me down completely the wrong path, I decided that a (hopefully) findable blog post could be helpful.