Issue
I have created a Dataset object that loads some data from an API when loading an item
class MyDataset(Dataset):
def __init__(self, obj_ids = []):
"""
"""
super(Dataset, self).__init__()
self.obj_ids = obj_ids
def __len__(self):
return len(self.obj_ids)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
result = session.get('/api/url/{}'.format(idx))
## Post processing work...
Then I add it to my Dataloader:
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=2, shuffle=True, num_workers=1,
collate_fn=utils.collate_fn)
Everything works fine when training this with num_workers=1
. But when I increase it to 2 or greater I get an error in my training loop.
On this line:
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
SSLError: Caught SSLError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.7/http/client.py", line 1373, in getresponse
response.begin()
File "/usr/lib/python3.7/http/client.py", line 319, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/dist-packages/urllib3/util/retry.py", line 399, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mydomain.com', port=443): Max retries exceeded with url: 'url_with_error_is_here' (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)')))
If I remove the post request, I stop getting the SSL error, so the problem most be something with the requests.post library or urllib maybe.
I changed the domain and url on the error to dummy values, but both url's and domains work when having just 1 worker.
I'm running this in a google collab environment with GPU enabled, but also tried it on my local machine and getting the same problem.
Can anyone help me to solve this issue?
Solution
After debugging a bit and reading more about multiprocessing
and request.session
. It seems that the problem is that I cannot use requests.session
inside a dataset as pytorch eventually uses multiprocessing on the training loop.
More about it on this question: How to assign python requests sessions for single processes in multiprocessing pool?
The issue is fixed by changing any session.get
or session.post
to a requests.get
and requests.post
as using it without session will avoid sharing the same connection and getting that SSLError.
Answered By - Pablo Estrada
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.