Issue
https://github.com/huggingface/transformers/blob/master/examples/run_glue.py
I want to adapt this script to do text classification on my data. The computer for this task is one single machine with two graphic cards. So this involves kind of "distributed" training with the term local_rank
in the script above, especially when local_rank
equals 0 or -1 like in line 83.
After reading some materials from distributed computation I guess that local_rank
is like an ID for a machine. And 0 may mean this machine is the "main" or "head" in the computation. But what is -1?
Solution
Q: But what is -1?
Usually, this is used to disable the distributed setting. Indeed, as you can see here:
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
and here:
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
setting local_rank
to -1
has this effect.
Answered By - Berriel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.