6

From looking at the CUDA 5.5 API Reference and the CUDA C Programming Guide it seems that there is no cudaCalloc(), an on-GPU equivalent of the standard C library's calloc().

  • Is there really no API functionality for allocating a buffer initialized to all-zeros?
  • Is there something better I can do than call cudaMalloc() and then cudaMemset()?
einpoklum
  • 102,731
  • 48
  • 279
  • 553
  • 6
    A compiler library probably implements calloc as a wrapper around malloc + memset anyhow. – Lundin Jan 20 '14 at 12:21
  • 2
    @Lundin: I would say probably not. You can allocate zero'ed memory blocks without actually setting any data, just by marking the blocks empty (e.g. not mapped from virtual to physical memory, and writing to them causes a page fault). Of course this depends on what machine you're on. I'm not sure how memory zero'ing happens on GPUs, but it may well be the case that you can do better than malloc+memset. – einpoklum Jan 20 '14 at 12:52
  • 1
    Use `thrust::device_vector`. – Jared Hoberock Jan 21 '14 at 01:43
  • @JaredHoberock: Can you make that an answer and explain why using `thrust::device_vector` is a good idea in this context? – einpoklum Jan 21 '14 at 09:36
  • Does anyone among the answers below satisfy you? If yes, please accept it. You have the bad habit of launching a stone and hiding the hand. – Vitality Jan 30 '14 at 13:35
  • @JackOLantern: I don't really like the macro in RobertCrovella's answer, but the answer is basically 'No', and those are the API calls for the workaround, so... – einpoklum Jan 30 '14 at 15:30

4 Answers4

12

Is there really no API functionality for allocating a buffer initialized to all-zeros?

There really is not.

Is there something better I can do that cudaMalloc() followed by cudaMemset()?

You could use a macro, if it's a matter of convenience (you haven't told us what you mean by better, if the answer to the first question is no):

#define cudaCalloc(A, B, C) \
    do { \
        cudaError_t __cudaCalloc_err = cudaMalloc(A, B*C); \
        if (__cudaCalloc_err == cudaSuccess) cudaMemset(*A, 0, B*C); \
    } while (0)

The above macro will work with the kind of error checking I usually do (which is based on using cudaGetLastError(); or you can build your preferred error checking directly into the macro, if you like. See this question about error handling.

Community
  • 1
  • 1
Robert Crovella
  • 131,712
  • 9
  • 184
  • 228
  • 1
    Revisiting your answer - why a macro rather than a function marked `inline`? – einpoklum May 22 '17 at 10:41
  • You could do that instead. – Robert Crovella May 22 '17 at 13:58
  • 1
    I was thinking perhaps, in our more civilized age (well, for programming anyway) it would be better to edit your answer to recommend that instead. Macros should not be encouraged unless absolutely necessary IMO. By the way, Nikolay Sakharnykh says hi. Or rather, I mentioned your name and he acknowledged it :-) – einpoklum May 22 '17 at 17:41
  • why don't you add an answer? Then you'll get the credit for being more civilized. I would upvote it. You can even un-accept this one and accept your own. – Robert Crovella May 22 '17 at 17:56
1

If all you want is a simple way to zero out new allocations, you can use thrust::device_vector, which default constructs its elements. For primitive types, this is the same behavior as calloc.

Jared Hoberock
  • 10,834
  • 2
  • 37
  • 74
0

There is no calloc()-like functionality in the CUDA Runtime API, nor another, lower-level equivalent. Instead, you can do the following:

cudaMalloc(&ptr, size);
cudaMemset(ptr, 0, size);

note that this is all synchronous. There's a cudaMemsetAsync() as well, although, frankly, cudaMalloc()s are currently slow enough that it doesn't really matter.

einpoklum
  • 102,731
  • 48
  • 279
  • 553
TripleS
  • 1,146
  • 1
  • 21
  • 37
  • 1
    Using Memcpy to zero a buffer is a rather bad idea, I think. – einpoklum Jan 20 '14 at 14:27
  • `cudaMemset()` runs asynchronously with the host anyway (see the [ref manual](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge07c97b96efd09abaeb3ca3b5f8da4ee)). – Tom Jan 20 '14 at 23:05
  • 1
    Not Ture, the function exhibit synchronize behavior for most cases – TripleS Jan 21 '14 at 05:06
0

Here is a solution with an inline function. devPtr is supposed to be a pointer to pointer to anything. Using a void* as function argument releases the caller from applying a cast.

inline cudaError_t
_cuda_calloc( void *devPtr, size_t size )
{
  cudaError_t err = cudaMalloc( (void**)devPtr, size );
  if( err == cudaSuccess ) err = cudaMemset( *(void**)devPtr, 0, size );
  return err;
}
Claas Bontus
  • 1,377
  • 1
  • 10
  • 26