Does CUDA really not have a calloc()-like API call?

Question

From looking at the CUDA 5.5 API Reference and the CUDA C Programming Guide it seems that there is no cudaCalloc(), an on-GPU equivalent of the standard C library's calloc().

Is there really no API functionality for allocating a buffer initialized to all-zeros?
Is there something better I can do than call cudaMalloc() and then cudaMemset()?

A compiler library probably implements calloc as a wrapper around malloc + memset anyhow. — Lundin, Jan 20 '14 at 12:21
@Lundin: I would say probably not. You can allocate zero'ed memory blocks without actually setting any data, just by marking the blocks empty (e.g. not mapped from virtual to physical memory, and writing to them causes a page fault). Of course this depends on what machine you're on. I'm not sure how memory zero'ing happens on GPUs, but it may well be the case that you can do better than malloc+memset. — einpoklum, Jan 20 '14 at 12:52
@JaredHoberock: Can you make that an answer and explain why using `thrust::device_vector` is a good idea in this context? — einpoklum, Jan 21 '14 at 09:36
Does anyone among the answers below satisfy you? If yes, please accept it. You have the bad habit of launching a stone and hiding the hand. — Vitality, Jan 30 '14 at 13:35
@JackOLantern: I don't really like the macro in RobertCrovella's answer, but the answer is basically 'No', and those are the API calls for the workaround, so... — einpoklum, Jan 30 '14 at 15:30

score 12 · Accepted Answer · edited May 23 '17 at 12:10

12

Is there really no API functionality for allocating a buffer initialized to all-zeros?

There really is not.

Is there something better I can do that cudaMalloc() followed by cudaMemset()?

You could use a macro, if it's a matter of convenience (you haven't told us what you mean by better, if the answer to the first question is no):

#define cudaCalloc(A, B, C) \
    do { \
        cudaError_t __cudaCalloc_err = cudaMalloc(A, B*C); \
        if (__cudaCalloc_err == cudaSuccess) cudaMemset(*A, 0, B*C); \
    } while (0)

The above macro will work with the kind of error checking I usually do (which is based on using cudaGetLastError(); or you can build your preferred error checking directly into the macro, if you like. See this question about error handling.

edited May 23 '17 at 12:10

Community

1
1

answered Jan 20 '14 at 14:31

Robert Crovella

131,712
9
184
228

1

Revisiting your answer - why a macro rather than a function marked `inline`? – einpoklum May 22 '17 at 10:41
You could do that instead. – Robert Crovella May 22 '17 at 13:58
1

I was thinking perhaps, in our more civilized age (well, for programming anyway) it would be better to edit your answer to recommend that instead. Macros should not be encouraged unless absolutely necessary IMO. By the way, Nikolay Sakharnykh says hi. Or rather, I mentioned your name and he acknowledged it :-) – einpoklum May 22 '17 at 17:41
why don't you add an answer? Then you'll get the credit for being more civilized. I would upvote it. You can even un-accept this one and accept your own. – Robert Crovella May 22 '17 at 17:56

score 1 · Answer 2 · answered Jan 21 '14 at 23:10

1

If all you want is a simple way to zero out new allocations, you can use thrust::device_vector, which default constructs its elements. For primitive types, this is the same behavior as calloc.

answered Jan 21 '14 at 23:10

Jared Hoberock

10,834
2
37
74

score 0 · Answer 3 · edited Dec 04 '18 at 16:31

0

There is no calloc()-like functionality in the CUDA Runtime API, nor another, lower-level equivalent. Instead, you can do the following:

cudaMalloc(&ptr, size);
cudaMemset(ptr, 0, size);

note that this is all synchronous. There's a cudaMemsetAsync() as well, although, frankly, cudaMalloc()s are currently slow enough that it doesn't really matter.

edited Dec 04 '18 at 16:31

einpoklum

102,731
48
279
553

answered Jan 20 '14 at 13:40

TripleS

1,146
1
21
37

1

Using Memcpy to zero a buffer is a rather bad idea, I think. – einpoklum Jan 20 '14 at 14:27
`cudaMemset()` runs asynchronously with the host anyway (see the [ref manual](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge07c97b96efd09abaeb3ca3b5f8da4ee)). – Tom Jan 20 '14 at 23:05
1

Not Ture, the function exhibit synchronize behavior for most cases – TripleS Jan 21 '14 at 05:06

score 0 · Answer 4 · answered Mar 01 '22 at 07:29

0

Here is a solution with an inline function. devPtr is supposed to be a pointer to pointer to anything. Using a void* as function argument releases the caller from applying a cast.

inline cudaError_t
_cuda_calloc( void *devPtr, size_t size )
{
  cudaError_t err = cudaMalloc( (void**)devPtr, size );
  if( err == cudaSuccess ) err = cudaMemset( *(void**)devPtr, 0, size );
  return err;
}

answered Mar 01 '22 at 07:29

Claas Bontus

1,377
1
10
26

Already suggested. – einpoklum Mar 01 '22 at 07:53

Does CUDA really not have a calloc()-like API call?

4 Answers4