Allocate 2D Array on Device Memory in CUDA

Question

How do I allocate and transfer(to and from Host) 2D arrays in device memory in Cuda?

score 19 · Accepted Answer · edited Feb 11 '13 at 19:18

I found a solution to this problem. I didn't have to flatten the array.

The inbuilt cudaMallocPitch() function did the job. And I could transfer the array to and from device using cudaMemcpy2D() function.

For example

cudaMallocPitch((void**) &array, &pitch, a*sizeof(float), b);

This creates a 2D array of size a*b with the pitch as passed in as parameter.

The following code creates a 2D array and loops over the elements. It compiles readily, you may use it.

#include<stdio.h>
#include<cuda.h>
#define height 50
#define width 50

// Device code
__global__ void kernel(float* devPtr, int pitch)
{
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
             float element = row[c];
        }
    }
}

//Host Code
int main()
{

float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height);
kernel<<<100, 512>>>(devPtr, pitch);
return 0;
}

is it possible to allocate a new row for the array later on? — scatman, Apr 12 '11 at 06:08

score 3 · Answer 2 · answered Jun 26 '09 at 05:50

3

Flatten it: make it one-dimensional. See how it's done here

answered Jun 26 '09 at 05:50

Rom

4,053
22
18

score 2 · Answer 3 · edited Oct 11 '12 at 19:06

2

Your device code could be faster. Try utilizing the threads more.

__global__ void kernel(float* devPtr, int pitch)
{
    int r = threadIdx.x;

    float* row = (float*)((char*)devPtr + r * pitch);
    for (int c = 0; c < width; ++c) {
         float element = row[c];
    }
}

Then you calculate the blocks and threads allocation appropriate so that each thread deals with a single element.

edited Oct 11 '12 at 19:06

Brian Mains

50,194
35
142
253

answered Jul 31 '12 at 05:56

Abdullah

21
1

The code Gitmo posted is a useless sample from the docs. Yes, your version is faster, but how do you do this in parallel for rows and columns? Strictly speaking you could have a mess in your hands because you don't check if `r` is less than the actual number of rows – darda Jun 19 '14 at 00:25

Allocate 2D Array on Device Memory in CUDA

3 Answers3

Linked