Monday, November 27, 2023

[FIXED] Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

November 27, 2023 deep-learning, machine-learning, python, pytorch No comments

Issue

I am running into the following error when trying to train this on this dataset.

Since this is the configuration published in the paper, I am assuming I am doing something incredibly wrong.

This error arrives on a different image every time I try to run training.

C:/w/1/s/windows/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1741, in <module>
    main()
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Noam/Code/vision_course/hopenet/deep-head-pose/code/original_code_augmented/train_hopenet_with_validation_holdout.py", line 187, in <module>
    loss_reg_yaw = reg_criterion(yaw_predicted, label_yaw_cont)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\modules\loss.py", line 431, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\functional.py", line 2204, in mse_loss
    ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

Any ideas?

Solution

This kind of error generally occurs when using NLLLoss or CrossEntropyLoss, and when your dataset has negative labels (or labels greater than the number of classes). That is also the exact error you are getting Assertion t >= 0 && t < n_classes failed.

This won't occur for MSELoss, but OP mentions that there is a CrossEntropyLoss somewhere and thus the error occurs (the program crashes asynchronously on some other line). The solution is to clean the dataset and ensure that t >= 0 && t < n_classes is satisfied (where t represents the label).

Also, ensure that your network output is in the range 0 to 1 in case you use NLLLoss or BCELoss (then you require softmax or sigmoid activation respectively). Note that this is not required for CrossEntropyLoss or BCEWithLogitsLoss because they implement the activation function inside the loss function. (Thanks to @PouyaB for pointing out).

Answered By - akshayk07

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 27, 2023

[FIXED] Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels