Issue
Win 10 64-bit 21H1; TF2.5, CUDA 11 installed in environment (Python 3.9.5 Xeus)
I am not the only one seeing this error; see also (unanswered) here and here. The issue is obscure and the proposed resolutions are unclear/don't seem to work (see e.g. here)
Issue Using the TF Linear_Mixed_Effects_Models.ipynb example (download from TensorFlow github here) execution reaches the point of performing the "warm up stage" then throws the error:
InternalError: libdevice not found at ./libdevice.10.bc [Op:__inference_one_e_step_2806]
The console contains this output showing that it finds the GPU but XLA initialisation fails to find the - existing! - libdevice in the specified paths
2021-08-01 22:04:36.691300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9623 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2021-08-01 22:04:37.080007: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2021-08-01 22:04:54.122528: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x1d724940130 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-08-01 22:04:54.127766: I tensorflow/compiler/xla/service/service.cc:177] StreamExecutor device (0): NVIDIA GeForce GTX 1080 Ti, Compute Capability 6.1
2021-08-01 22:04:54.215072: W tensorflow/compiler/tf2xla/kernels/random_ops.cc:241] Warning: Using tf.random.uniform with XLA compilation will ignore seeds; consider using tf.random.stateless_uniform instead if reproducible behavior is desired.
2021-08-01 22:04:55.506464: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
2021-08-01 22:04:55.512876: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:74] Searched for CUDA in the following directories:
2021-08-01 22:04:55.517387: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:77] C:/Users/Julian/anaconda3/envs/TF250_PY395_xeus/Library/bin
2021-08-01 22:04:55.520773: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:77] C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2
2021-08-01 22:04:55.524125: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:77] .
2021-08-01 22:04:55.526349: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:79] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
Now the interesting thing is that the paths searched includes "C:/Users/Julian/anaconda3/envs/TF250_PY395_xeus/Library/bin"
the content of that folder includes all the (successfully loaded at TF startup) DLLs, including cudart64_110.dll, dudnn64_8.dll... and of course libdevice.10.bc
Question Since TF says it is searching this location for this file and the file exists there, what is wrong and how do I fix it?
(NB C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2
does not exist... CUDA is intalled in the environment; this path must be a best guess for an OS installation)
Info: I am setting the path by
aPath = '--xla_gpu_cuda_data_dir=C:/Users/Julian/anaconda3/envs/TF250_PY395_xeus/Library/bin'
print(aPath)
os.environ['XLA_FLAGS'] = aPath
but I have also set an OS environment variable XLA_FLAGS to the same string value... I don't know which one is actually working yet, but the fact that the console output says it searched the intended path is good enough
Solution
The diagnostic information is unclear and thus unhelpful; there is however a resolution
The issue was resolved by providing the file (as a copy) at this path
C:\Users\Julian\anaconda3\envs\TF250_PY395_xeus\Library\bin\nvvm\libdevice\
Note that C:\Users\Julian\anaconda3\envs\TF250_PY395_xeus\Library\bin
was the path given to XLA_FLAGS, but it seems it is not looking for the libdevice file there it is looking for the \nvvm\libdevice\ path This means that I can't just set a different value in XLA_FLAGS to point to the actual location of the libdevice file because, to coin a phrase, it's not (just) the file it's looking for.
The debug info earlier:
2021-08-05 08:38:52.889213: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
2021-08-05 08:38:52.896033: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:74] Searched for CUDA in the following directories:
2021-08-05 08:38:52.899128: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:77] C:/Users/Julian/anaconda3/envs/TF250_PY395_xeus/Library/bin
2021-08-05 08:38:52.902510: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:77] C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2
2021-08-05 08:38:52.905815: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:77] .
is incorrect insofar as there is no "CUDA" in the search path; and FWIW I think a different error should have been given for searching in C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2
since there is no such folder (there's an old V10.0 folder there, but no OS install of CUDA 11)
Until/unless path handling is improved by TensorFlow such file structure manipulation is needed in every new (Anaconda) python environment.
Full thread in TensorFlow forum here
Answered By - Julian Moore
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.