How to time part of a CUDA kernel?

Asked Dec 29 '21 at 04:43

Active Dec 29 '21 at 05:18

Viewed 15 times

I need test some performance in a CUDA kernel running on an A100. The Best Practices Guide says that we can use either cudaevents or standard timing functions like clock() in Linux.

Although these methods can get very accurate results when testing a whole kernel function, I want to know more details and cost times about operations in my kernel function so I can find the bottleneck of my kernel code. Are there any tricks to get cost time of part of kernel function?

edited Dec 29 '21 at 05:18

talonmies

68,743
34
184
258

asked Dec 29 '21 at 04:43

fff

1

If you search the `cuda` tag for `clock64()` you will find other useful info about in-kernel timing, such as [this answer](https://stackoverflow.com/questions/60739210/instruction-execution-order-by-cuda-driver/60777298#60777298). Also, take note of the suggestion to use `globaltimer` [here](https://stackoverflow.com/questions/43008430/how-to-convert-cuda-clock-cycles-to-milliseconds) – Robert Crovella Dec 29 '21 at 05:00

How to time part of a CUDA kernel?

0 Answers0