Issue
This is a cross-post to my question in the Pytorch forum.
When using DistributedDataParallel
(DDP) from PyTorch on only one node I expect it to be the same as a script without DistributedDataParallel
.
I created a simple MNIST training setup with a three-layer feed forward neural network. It gives significantly lower accuracy (around 10%) if trained with the same hyperparameters, same epochs, and generally same code but the usage of the DDP library.
I created a GitHub repository demonstrating my problem.
I hope it is a usage error of the library, but I do not see how there is a problem, also colleges of mine did already audit the code. Also, I tried it on macOS with a CPU and on three different GPU/ubuntu combinations (one with a 1080-Ti, one with a 2080-Ti and a cluster with P100s inside) all giving the same results. Seeds are fixed for reproducibility.
Solution
You are using different batch sizes in your two experiments: batch_size=128
, and batch_size=32
for mnist-distributed.py
and mnist-plain.py
respectively. This would indicate that you won't have the same performance result with those two trainings.
Answered By - Ivan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.