Issue
I am tuning the hyperparameters using ray tune. The model is built in the tensorflow library, it occupies a large part of the available GPU memory. I noticed that every second call reports an out of memory error.It looks like the memory is being freed, you can see in the GPU memory usage graph, this is the moment between calls of consecutive trials, between which the OOM error occurred. I add that on smaller models I do not encounter this error and the graph looks the same.
How to deal with this out of memory error in every second trial ?
Solution
There's actually a utility that helps avoid this:
https://docs.ray.io/en/master/tune/api_docs/trainable.html#ray.tune.utils.wait_for_gpu
def tune_func(config):
tune.utils.wait_for_gpu()
train()
tune.run(tune_func, resources_per_trial={"GPU": 1}, num_samples=10)
Answered By - richliaw
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.