Issue
Does integration.TorchDistributedTrial
support multinode optimization?
I'm using Optuna on a SLURM cluster. Suppose I would like to do a distributed hyperparameter optimization using two nodes with two gpus each. Would submitting a script like pytorch_distributed_simple.py to multiple nodes yield expected results?
I assume every node would be responsible for executing their own trials (i.e. no nodes share trials) and every gpu on a node is responsible for its own portion of the data, determined by torch.utils.data.Dataloader
's sampler
. Is this assumption correct or are edits needed apart from TorchDistributedTrial
's requirement to pass None
to objective
calls on ranks other than 0.
I already tried the above, but I'm not sure how to check every node is responsible for distinct trials.
Solution
Apparently, Optuna does allow multiple Optuna processes to do distributed runs. Why wouldn't it :)
Basically, run pytorch_distributed_simple.py on multiple nodes (I use SLURM for this) and make sure every subprocess calls the trial.report() method. Every node is now responsible for its own trial. Trials can use DDP.
My method differs from the provided code in that I use SLURM (different environment variables) and I use sqlite to store study information. Moreover, I use the NCCL backend to initialize process groups, and therefore need to pass a device to TorchDistributedTrial
.
Unrelated, but I also wanted to call MaxTrialsCallback() in every subprocess. To achieve this, I passed the callback to the rank 0 study.optimizer method and call it explicitly in local non-rank 0 processes after the objective call.
Answered By - Siem
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.