CUDA runtime error (59) : device-side assert triggered

Question

I have access to Tesla K20c, I am running ResNet50 on CIFAR10 dataset... Then I get the error as:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "main.py", line 109, in <module>
    train(loader_train, model, criterion, optimizer)
  File "main.py", line 54, in train
    optimizer.step()
  File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
    d_p.add_(weight_decay, p.data)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265

How to resolve this error?

try running your script with `CUDA_LAUNCH_BLOCKING=1 python your_script.py` to get a more accuracte stack trace. — McLawrence, Aug 05 '18 at 07:16
after running with CUDA_LAUNC...=1, I get the error as `/opt/conda/.../THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.` This would come around 20 times. then the Traceback follows: `RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116` how to resolve? — saichand, Aug 05 '18 at 08:00
This is an error with your target labels: `t >= 0 && t < n_classes`. print your labels and make sure that they are positive and smaller than the number of outputs of your last layer. — McLawrence, Aug 05 '18 at 08:04
n_classes should be same as the output of the last layer.. Is it right? — saichand, Aug 05 '18 at 08:11
@McLawrence, my error points me to `return self.apply(lambda x: x.to(device), *keys)` But if I don't use the **to(device)** option, it shows the device mismatch error between CUDA (required for x) and cpu(of actual x in this case) — Kanishk Mair, Feb 28 '20 at 05:40

score 85 · Answer 1 · edited Oct 09 '21 at 08:09

85

I have encountered this problem several times. And I find it to be an index issue.

For example, if your ground truth label starts at 1: target = [1,2,3,4,5], then you should subtract 1 for every label, change it to: [0,1,2,3,4].

This solves my problem every time.

edited Oct 09 '21 at 08:09

nik7

803
3
9
18

answered Mar 21 '19 at 01:08

Rainy

851
5
3

1

I can confirm, this was also the cause of error in my case. For example, valid text labels have been converted to 0..n-1 (n being the number of classes). However, NaN values were converted to -1, which sent it off the rails. – Christian Mar 21 '19 at 01:13
1

@Rainy can you elaborate on "ground truth label starts at 1". What do you mean by that? I gather that the labels are 1 to 5 and to overcome the error the first value in the error should be zero. Am I right? – Kunj Mehta Oct 02 '19 at 14:55
@KunjMehta, Not just first value should be zero. Class index should start from zero. e.g. for 6 classes, index values should be from 0 to 5. – Chandra Jan 20 '20 at 04:22
I get the error even though I have the setup you offer – Nihat Nov 20 '20 at 14:25
saved my day! Thank you. – Oras Jan 17 '22 at 14:24

score 59 · Accepted Answer · edited May 12 '19 at 20:20

59

In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

In your specific case, the targets of your data were too high (or low) for the specified number of classes.

edited May 12 '19 at 20:20

roboo.jack

5
4

answered Aug 06 '18 at 06:28

McLawrence

4,355
6
36
45

8

To add to this, once you get a more accurate stack trace and locate where the issue is, you can move your tensors to CPU. Moving the tensors to CPU will give much more detailed errors. Combining `CUDA_LAUNCH_BLOCKING=1` with moving the tensors to CPU was the only way I was able to solve a problem I spent 3 days on. – Eric Wiener Nov 05 '20 at 01:44

score 10 · Answer 3 · answered Sep 18 '20 at 08:27

I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.

score 5 · Answer 4 · edited Aug 15 '21 at 10:08

One way to raise the "CUDA error: device-side assert triggered" RuntimeError, is by indexing into a GPU torch.Tensor using a list having out of dimension indices.

So, this snippet would raise an IndexError with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error

data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]

whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError

data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]

which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes) that is causing the error, when the labels start from 1 rather than 0.

Also, when device is "cpu" the error thrown is IndexError such as the one thrown by the first snippet.

score 2 · Answer 5 · answered Oct 07 '20 at 17:36

This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this

max(train_labels)
min(train_labels)

which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:

train_=train.copy()
train_['label'] =train_['label'].replace(2,1)

then you run the same code and see the results, it should work

class NDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NDataset(train_encodings, train_labels)
val_dataset = NDataset(val_encodings, val_labels)
test_dataset = NDataset(test_encodings, test_labels)

score 2 · Answer 6 · answered Aug 20 '21 at 22:32

2

I found I got this error when I had a label with an invalid value.

answered Aug 20 '21 at 22:32

arame3333

9,477
25
112
198

Even in my case, the issue was with the invalid value of labels as I forgot to put activation in the last layer. Thanks! – Vinod Kumar Chauhan Mar 22 '22 at 11:08

score 0 · Answer 7 · answered Apr 29 '22 at 10:32

0

Happened to me multiple time when the target or label of the bce or ce loss would be <= 0.

answered Apr 29 '22 at 10:32

Valentin

115
2
13

score 0 · Answer 8 · answered May 11 '22 at 04:14

0

This can also be caused by nan values in your model input data. One easy way to "treat" this problem is to convert any that pop up into zeros on the fly:

batch_data = batch_data[batch_data != batch_data] = 0

answered May 11 '22 at 04:14

user2299067

37
7

score 0 · Answer 9 · answered May 30 '22 at 08:28

0

Another situation where this can happen: you are training a dataset with more classes than you last layer expects. It's another unexpected index situation

answered May 30 '22 at 08:28

Angel Salamanca

11
1
3

CUDA runtime error (59) : device-side assert triggered

9 Answers9

Linked