Issue
I have installed in Windows 10 with WSL2 (Ubuntu 22.04 Kernel), the Tensorflow 2.12, Cuda Toolkit 11.8.0 and cuDNN 8.6.0.163 in Miniconda environment (Python 3.9.16), normally and as the official tensorflow.org recommend. I should emphasize at this point that I want to use Tensorflow 2.12 because with the correspond Cuda Toolkit 11.8.0 it is compatible with Ada Lovelace GPUs (RTX4080 for my case).
When I go to train my model, it gives me the following error:
"Loaded cuDNN version 8600 Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so : cannot open shared object file: No such file or directory".
Is there any idea that is going wrong*?
The paths were configured as follows:
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
The files referring to my error were searched for using the following commands:
ldconfig -p | grep libcudnn_cnn
but it returned nothing so the file does not exist, andldconfig -p | grep libcuda
where returnedlibcuda.so.1 (libc6,x86-64) => /usr/lib/wsl/lib/libcuda.so.1
Also, I have try to set the new environmental variable and include that to $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
but without any luck:
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
*Note that when importing the Tensorflow, I get the following warnings:
TF-TRT Warning: Could not find TensorRT
-
could not open file to read NUMA node: /sys/bus/pci/devices/0000:1c:00.0/numa_node Your kernel may have been built without NUMA support.
In addition, an attempt to follow the NVIDIA Documentation for WSL, specific in section 3 -> Option 1, but this does not solve the problem.
Solution
Ran into this problem and found a working solution after a lot of digging around.
First, the missing libcuda.so
can be solved by the method proposed here: https://github.com/microsoft/WSL/issues/5663#issuecomment-1068499676
Essentially rebuilding the symbolic links in the CUDA lib directory:
> cd \Windows\System32\lxss\lib
> del libcuda.so
> del libcuda.so.1
> mklink libcuda.so libcuda.so.1.1
> mklink libcuda.so.1 libcuda.so.1.1
(this is done in an admin elevated Command Prompt shell)
Then when you run into the missing device problem (which you undoubtfully will), solve it by: https://github.com/tensorflow/tensorflow/issues/58681#issuecomment-1406967453
Which boils down to:
$ mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice/
$ cp -p $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/
$ export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib
And
$ conda install -c nvidia cuda-nvcc --yes
(verify by ptxas --version
)
If you're running notebooks in VSCode remote WSL then you'd need to add export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib
to /$CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
(this is good practice anyway)
Answered By - Roy Shilkrot
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.