How do I get reliable timing data for time spent in function calls in my code?

Question

This question is a follow-up to Fortran: Best way to time sections of your code?.

If I want to time functions in my code, I know I could use gprof or kcachegrind. I also know that the results from these tools can be skewed (see http://www.yosefk.com/blog/how-profilers-lie-the-cases-of-gprof-and-kcachegrind.html and https://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343).

I know I could add manual timers to each function for which I want data, which can be tedious or impractical for libraries, if I want data for everything.

Unfortunately, I run into communities that want this timing data to use as evidence in arguing for the performance their methods (to demonstrate improvement in performance, point out spots where performance is bad, for scientific papers, and so on). This seems to be popular with management-types and some academic-types. Is there a better way to get reliably accurate timing data than inserting timers? Should I be using a combination of imperfect tools and sifting through the performance data in some way?

(Note: This question isn't about performance tuning, even though it's related. You can do performance tuning without timing things by using random pausing. It also isn't about whether or not timing is worthwhile, because these communities want timing data, and I don't have the power to change their mind easily. Any comments about these topics are great discussion, but they're not helpful in answering my question, because the reality is that the people I answer to want timing data that somehow reflects performance.)

Hi Geoff. You can get fairly accurate time fraction if you can get a large number of stack samples, as you might with oprofile. The fraction of samples $f$ where your function appears gives its inclusive fraction. If you get $n$ samples, the standard error is $\sqrt{nf(1-f)}$. So for a 1% standard error, you need around 10^4 to 10^5 samples. — Mike Dunlavey, Oct 17 '16 at 15:14

score 3 · Answer 1 · answered Apr 15 '13 at 00:22

3

You might consider a stack sampling profiler like HPCToolkit or VTune, or the system profiler for Linux, prof.

Also, I don't see what's objectionable about wanting to know how long things take. If you want to demonstrate that your implementation of an algorithm has the asymptotic performance you derived, actually measuring the running time is the best way to do so.

answered Apr 15 '13 at 00:22

Bill Barth

10,905
1
21
39

I don't either, but the random pausing crowd likes to object to timing in favor of stack samples. – Geoff Oxberry Apr 15 '13 at 00:33
Isn't random pausing proto-stack-sampling? – Bill Barth Apr 15 '13 at 01:00
Never mind, I misread your comment. A better point would be that they may be objecting to timing because of the known issues with timing profilers. Explicit timing around routines of interest can't be objected to on these grounds, though you may be looking in the wrong place without good sampling. Depends on your purpose, I guess. – Bill Barth Apr 15 '13 at 01:11
@Geoff: Stack samples tell you percentages of total time, where total time itself is trivial to measure. If you want statistical precision of those percentages, you need lots of samples. Then if you want average-time-per-call you also need invocation counts of the suspect routines. That's how I would proceed.

Mike Dunlavey

Apr 19 '13 at 01:12

@Geoff: I forgot to mention Zoom. The bad part is it is not free software, and I've been admonished not to recommend it for that reason :) The good part is it may do what you need. – Mike Dunlavey Apr 19 '13 at 01:21

@BillBarth: Here's the long explanation. The short explanation is it depends on your objective, whether it's measuring the code or speeding up the code. When the goal is to speed up the code, applying full attention to a small number of maximally-informative samples finds speedup opportunities that are not found simply by making high-precision measurements. It depends on one's purpose. – Mike Dunlavey Apr 22 '13 at 15:27

@BillBarth: "Isn't random pausing proto-stack-sampling?" There's an observer-bias issue. Even good profilers tend to say there is no way to speed up the code when there actually is. So if that's what someone wants to hear, they will like it. On the other hand, if somebody really needs to squeeze cycles, each stack sample is rich in information about why a moment is spent, and seeing a nugget on two samples nails it, while mushing a large number of samples into statistics loses that explanatory information. It's quality vs. quantity. – Mike Dunlavey Aug 22 '13 at 15:19

@MikeDunlavey: If you have a good stack-sampling profiler, then you have all the data disaggregated and can look at it in that form and ignore the summaries. My experience with your random-pausing method was that it led me down a rabbit hole of a routine that wasn't all that important (less than 5% of time spent) because I got unlucky with a couple of pauses. Why not let the profiler do the pausing and then flip through its logs? – Bill Barth Aug 22 '13 at 16:50

@BillBarth: Can you flip through its logs, and see the actual raw samples (with line numbers preferably)? I've done that with the R profiler, and though it didn't have line numbers, it was still useful for the issues I had. I am curious about your experience. Usually 10-20 samples is unambiguous. It's a distribution - a low percent is possible, and so is a high percent. I have heard of people doing things like a) taking samples while it waits for user input, or b) not really looking at the stack, just the PC. Maybe it takes a bit of practice. – Mike Dunlavey Aug 22 '13 at 17:24

@MikeDunlavey: Off the top of my head, I don't know. I'd have to look into the data formats of HPCToolkit and VTune in order to figure it out. Both collect massive data files in my experience. – Bill Barth Aug 22 '13 at 19:04

score 3 · Answer 2 · answered Apr 15 '13 at 02:44

While it may be true that gprof or valgrind's cachegrind can produce skewed results, they are almost always good enough for what you really want to do -- namely find out which functions are "expensive" and which are "not expensive". As the article you quote shows, it's possible to generate programs for which profilers do not show the whole story, or in fact even show falsehoods. However, when you apply profiles to "real programs", they do almost always give you a fairly good picture of the truth, and that is precisely why they are still used.

In other words, despite their limitations, I do believe that when applied to real-world programs, profiles do show very useful data and I would not hesitate to include this data into publications (with a short description of said limitations).

score 3 · Answer 3 · answered Apr 15 '13 at 09:50

I've often used Intel's VTune Amplifier to get precise timings, which, on the right hardware, will break timings down to the instruction level. The better results come from using on-chip counters, i.e. the Performance Monitoring Unit.

The counts are still not exact, but have a much better resolution than what you can get out of software-based collectors.

As for your comment on the need for exact numbers, I don't completely agree. I don't consider myself to be a bean counter, but I rely heavily on timer, mostly implemented directly in the code using calls to cycle.h from FFTW wrapped in macros, for actual research. Specifically, I work on algorithms for task-based parallelism, and need good estimates of the time spent doing actual work vs. overheads in the task allocation. These overheads are usually the sum of many small function calls and can be difficult to assess in a profiler, but are pretty much the only measure of how good a scheduling scheme is. In this case, precise timers are actually a necessary condition to doing good research.

How do I get reliable timing data for time spent in function calls in my code?

3 Answers3