I need test some performance in a CUDA kernel running on an A100. The Best Practices Guide says that we can use either cudaevents or standard timing functions like clock() in Linux.
Although these methods can get very accurate results when testing a whole kernel function, I want to know more details and cost times about operations in my kernel function so I can find the bottleneck of my kernel code. Are there any tricks to get cost time of part of kernel function?