Issue
I'm using an ubuntu 18 docker container.
$cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"
When I try training a resnext101 model from torchvision, I get the following error.
Downloading: "https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth" to /home/vmuser/.cache/torch/hub/checkpoints/resnext101_32x8d-8ba56ff5.pth
0%| | 0.00/340M [00:00<?, ?B/s]
Traceback (most recent call last):
File "train_attn_best_config.py", line 377, in <module>
tabct = TabCT(cnn = model, fc_dim = fd, attn_filters = af, n_attn_layers = nal).to(gpu)
File "train_attn_best_config.py", line 219, in __init__
self.ct_cnn = cnn_dict[cnn](pretrained = True)
File "/home/vmuser/anaconda3/envs/pulmo/lib/python3.7/site-packages/torchvision/models/resnet.py", line 317, in resnext101_32x8d
pretrained, progress, **kwargs)
File "/home/vmuser/anaconda3/envs/pulmo/lib/python3.7/site-packages/torchvision/models/resnet.py", line 227, in _resnet
progress=progress)
File "/home/vmuser/anaconda3/envs/pulmo/lib/python3.7/site-packages/torch/hub.py", line 481, in load_state_dict_from_url
download_url_to_file(url, cached_file, hash_prefix, progress=progress)
File "/home/vmuser/anaconda3/envs/pulmo/lib/python3.7/site-packages/torch/hub.py", line 404, in download_url_to_file
f.write(buffer)
File "/home/vmuser/anaconda3/envs/pulmo/lib/python3.7/tempfile.py", line 481, in func_wrapper
return func(*args, **kwargs)
OSError: [Errno 28] No space left on device
When I run df
, I get this, one of my tmpfs is only 65 mb. I tried running export TMPDIR=/var/tmp
and export TMPDIR=~/Data/tmp
$df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 1797272568 1705953392 0 100% /
tmpfs 65536 0 65536 0% /dev
tmpfs 98346264 0 98346264 0% /sys/fs/cgroup
/dev/sda6 1797272568 1705953392 0 100% /etc/hosts
shm 65536 0 65536 0% /dev/shm
/dev/sdb1 1845816492 1362932848 389098592 78% /home/vmuser/Data
tmpfs 98346264 12 98346252 1% /proc/driver/nvidia
tmpfs 19669256 93256 19576000 1% /run/nvidia-persistenced/socket
udev 98318592 0 98318592 0% /dev/nvidia1
tmpfs 98346264 0 98346264 0% /proc/acpi
tmpfs 98346264 0 98346264 0% /proc/scsi
tmpfs 98346264 0 98346264 0% /sys/firmware
But the error is still there.
Solution
This seems like a shm
issue.
Try running docker with ipc=host
flag.
For more details, see this thread.
Answered By - Shai
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.