Issue
For our development team we want to build a central GPU server for their deep learning / training tasks (with one or more strong GPU(s) instead of mulitple workstations for each team member with their own GPU). I guess this is a common setup, but I am not sure how to make this GPU sharing work for multiple team members simultaneously. We work with Tensorflow/Keras and Python scripts.
My question is: What is the typical approach to let team members train their models on that central server? Just allow them to access via SSH and do network training directly from command line? Or setup a Jupyter Hub server, so that our developers can run code from their browser?
My main question: If there is only one GPU, how can we make sure that multiple users cannot run their code (i.e. train their networks) at the same time? Is there a way to kind of submit training jobs on a central server software and those are executed on the GPU one after the other?
(Sorry if this is not the correct site to ask this question, but which other Stack Exchange site would be better?)
Solution
Even though we don't need this setup any more, one option to solve this is via a workload manager like slurm. There is also GPU management available.
Answered By - Matthias
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.