61

I have access to Tesla K20c, I am running ResNet50 on CIFAR10 dataset... Then I get the error as:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "main.py", line 109, in <module>
    train(loader_train, model, criterion, optimizer)
  File "main.py", line 54, in train
    optimizer.step()
  File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
    d_p.add_(weight_decay, p.data)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265

How to resolve this error?

nik7
  • 803
  • 3
  • 9
  • 18
saichand
  • 865
  • 1
  • 9
  • 23
  • 2
    try running your script with `CUDA_LAUNCH_BLOCKING=1 python your_script.py` to get a more accuracte stack trace. – McLawrence Aug 05 '18 at 07:16
  • after running with CUDA_LAUNC...=1, I get the error as `/opt/conda/.../THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.` This would come around 20 times. then the Traceback follows: `RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116` how to resolve? – saichand Aug 05 '18 at 08:00
  • 8
    This is an error with your target labels: `t >= 0 && t < n_classes`. print your labels and make sure that they are positive and smaller than the number of outputs of your last layer. – McLawrence Aug 05 '18 at 08:04
  • n_classes should be same as the output of the last layer.. Is it right? – saichand Aug 05 '18 at 08:11
  • That's right. Your targets likely assume to high values. – McLawrence Aug 05 '18 at 08:16
  • @McLawrence, my error points me to `return self.apply(lambda x: x.to(device), *keys)` But if I don't use the **to(device)** option, it shows the device mismatch error between CUDA (required for x) and cpu(of actual x in this case) – Kanishk Mair Feb 28 '20 at 05:40

9 Answers9

85

I have encountered this problem several times. And I find it to be an index issue.

For example, if your ground truth label starts at 1: target = [1,2,3,4,5], then you should subtract 1 for every label, change it to: [0,1,2,3,4].

This solves my problem every time.

nik7
  • 803
  • 3
  • 9
  • 18
Rainy
  • 851
  • 5
  • 3
  • 1
    I can confirm, this was also the cause of error in my case. For example, valid text labels have been converted to 0..n-1 (n being the number of classes). However, NaN values were converted to -1, which sent it off the rails. – Christian Mar 21 '19 at 01:13
  • 1
    @Rainy can you elaborate on "ground truth label starts at 1". What do you mean by that? I gather that the labels are 1 to 5 and to overcome the error the first value in the error should be zero. Am I right? – Kunj Mehta Oct 02 '19 at 14:55
  • @KunjMehta, Not just first value should be zero. Class index should start from zero. e.g. for 6 classes, index values should be from 0 to 5. – Chandra Jan 20 '20 at 04:22
  • I get the error even though I have the setup you offer – Nihat Nov 20 '20 at 14:25
  • saved my day! Thank you. – Oras Jan 17 '22 at 14:24
59

In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

In your specific case, the targets of your data were too high (or low) for the specified number of classes.

McLawrence
  • 4,355
  • 6
  • 36
  • 45
  • 8
    To add to this, once you get a more accurate stack trace and locate where the issue is, you can move your tensors to CPU. Moving the tensors to CPU will give much more detailed errors. Combining `CUDA_LAUNCH_BLOCKING=1` with moving the tensors to CPU was the only way I was able to solve a problem I spent 3 days on. – Eric Wiener Nov 05 '20 at 01:44
10

I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.

R Tiffin
  • 131
  • 1
  • 6
5

One way to raise the "CUDA error: device-side assert triggered" RuntimeError, is by indexing into a GPU torch.Tensor using a list having out of dimension indices.

So, this snippet would raise an IndexError with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error

data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]

whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError

data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]

which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes) that is causing the error, when the labels start from 1 rather than 0.

Also, when device is "cpu" the error thrown is IndexError such as the one thrown by the first snippet.

hdkrgr
  • 1,466
  • 14
  • 21
Alan
  • 140
  • 3
  • 5
2

This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this

max(train_labels)
min(train_labels)

which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:

train_=train.copy()
train_['label'] =train_['label'].replace(2,1)

then you run the same code and see the results, it should work

class NDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NDataset(train_encodings, train_labels)
val_dataset = NDataset(val_encodings, val_labels)
test_dataset = NDataset(test_encodings, test_labels)
Shaina Raza
  • 1,095
  • 12
  • 10
2

I found I got this error when I had a label with an invalid value.

arame3333
  • 9,477
  • 25
  • 112
  • 198
0

Happened to me multiple time when the target or label of the bce or ce loss would be <= 0.

Valentin
  • 115
  • 2
  • 13
0

This can also be caused by nan values in your model input data. One easy way to "treat" this problem is to convert any that pop up into zeros on the fly:

batch_data = batch_data[batch_data != batch_data] = 0
0

Another situation where this can happen: you are training a dataset with more classes than you last layer expects. It's another unexpected index situation