Issue
There are 3 GPUs in my system.
I want to run on the last one i.e. 2. For this reason, I set gpu_id
as 2 in my configuration file as well as CUDA_VISIBLE_DEVICES=2
. But in my program, the following line always assigns the 0th GPU.
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
How to fix this issue?
Solution
When setting CUDA_VISIBLE_DEVICES=2
you tell the OS to only expose the third GPU to your process. That is, as far as PyTorch is concerned, there is only one GPU. Therefore torch.distributed.get_world_size()
returns 1
(and not 3).
The rank of this GPU, in your process, will be 0 - since there are no other GPUs available for the process. But as far as the OS is concerned - all processing are done on the third GPU that was allocated to the job.
Answered By - Shai
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.