Issue
I am trying to run a tensorflow project and I am encountering memory problems on the university HPC cluster. I have to run a prediction job for hundreds of inputs, with differing lengths. We have GPU nodes with different amounts of vmem, so I am trying to set up the scripts in a way that will not crash in any combination of GPU node - input length.
After searching the net for solutions, I played around with TF_FORCE_UNIFIED_MEMORY, XLA_PYTHON_CLIENT_MEM_FRACTION, XLA_PYTHON_CLIENT_PREALLOCATE, and TF_FORCE_GPU_ALLOW_GROWTH, and also with tensorflow's set_memory_growth
. As I understood, with unified memory, I should be able to use more memory than a GPU has in itself.
This was my final solution (only relevant parts)
os.environ['TF_FORCE_UNIFIED_MEMORY']='1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION']='2.0'
#os.environ['XLA_PYTHON_CLIENT_PREALLOCATE']='false'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH ']='true' # as I understood, this is redundant with the set_memory_growth part :)
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
print(gpu)
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
and I submit it on the cluster with --mem=30G
(slurm job scheduler) and --gres=gpu:1
.
And this is the error my code crashes with. As I understand, it does try to use the unified memory but is failing for some reason.
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5582 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:02:00.0, compute capability: 3.5)
2021-08-24 09:22:02.053935: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 12758286336 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:03.738635: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 11482457088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:05.418059: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 10334211072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:07.102411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 9300789248 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:08.784349: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 8370710016 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:10.468644: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 7533638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:12.150588: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 6780274688 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:23:10.326528: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.33GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
File "scripts/script.py", line 654, in <module>
prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "env/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "env/lib/python3.7/site-packages/jax/_src/api.py", line 402, in cache_miss
donated_invars=donated_invars, inline=inline)
File "env/lib/python3.7/site-packages/jax/core.py", line 1561, in bind
return call_bind(self, fun, *args, **params)
File "env/lib/python3.7/site-packages/jax/core.py", line 1552, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "env/lib/python3.7/site-packages/jax/core.py", line 1564, in process
return trace.process_call(self, fun, tracers, params)
File "env/lib/python3.7/site-packages/jax/core.py", line 607, in process_call
return primitive.impl(f, *tracers, **params)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 608, in _xla_call_impl
*unsafe_map(arg_spec, args))
File "env/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun
ans = call(fun, *args)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 758, in _xla_callable
compiled = compile_or_get_cached(backend, built, options)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 76, in compile_or_get_cached
return backend_compile(backend, computation, compile_options)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scripts/script.py", line 654, in <module>
prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
return backend.compile(built_c, compile_options=options)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.
I would be glad for any ideas on how to get it to work and use all the available memory.
Thank you!
Solution
Looks like your GPU doesn't fully support unified memory. The support is limited and in fact the GPU holds all data in its memory.
See this article for the description: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
In particular:
On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made. Internally, the driver also sets up page table entries for all pages covered by the allocation, so that the system knows that the pages are resident on that GPU.
And:
Since these older GPUs can’t page fault, all data must be resident on the GPU just in case the kernel accesses it (even if it won’t).
And your GPU is Kepler-based, according to TechPowerUp: https://www.techpowerup.com/gpu-specs/geforce-gtx-titan-black.c2549
As far as I know, TensorFlow should also issue a warning about that. Something like:
Unified memory on GPUs with compute capability lower than 6.0 (pre-Pascal class GPUs) does not support oversubscription.
Answered By - dm0_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.