Issue
I'm trying to run the same pytorch training script with different arguments (argparse) from another python script. I'm using os.system()
for the same.
Here's what I'm trying to do -
train.py
= > the script which contains the train-loop.
runner.py
=> the file which runs the train script in a loop.
# runner.py
for hp in hyperparams:
os.system(f"CUDA_VISIBLE_DEVICES=1 python train.py --arg1 hp")
A few models get trained but I eventually end up getting a CUDA out of memory error. For instance, if there were 10 models, it will successfully train 8 and then give a CUDA error for 9 and 10.
My guess is that the GPU memory is not being cleared after every loop. What can I do to mitigate this?
Solution
torch.cuda.empty_cache()
along with deleting the models helped.
Answered By - theairbend3r
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.