0

I need test some performance in a CUDA kernel running on an A100. The Best Practices Guide says that we can use either cudaevents or standard timing functions like clock() in Linux.

Although these methods can get very accurate results when testing a whole kernel function, I want to know more details and cost times about operations in my kernel function so I can find the bottleneck of my kernel code. Are there any tricks to get cost time of part of kernel function?

talonmies
  • 68,743
  • 34
  • 184
  • 258
fff
  • 1
  • 1
  • 1
    If you search the `cuda` tag for `clock64()` you will find other useful info about in-kernel timing, such as [this answer](https://stackoverflow.com/questions/60739210/instruction-execution-order-by-cuda-driver/60777298#60777298). Also, take note of the suggestion to use `globaltimer` [here](https://stackoverflow.com/questions/43008430/how-to-convert-cuda-clock-cycles-to-milliseconds) – Robert Crovella Dec 29 '21 at 05:00

0 Answers0