How to compare the results of two benchmarks?

Question

I am writing a program and would be interested in knowing how much faster does a version run compared to another.

For that, I wrote a benchmark script that times the run time of my program, repeats the process multiple times and computes the average of the results.

Let's say I run it on the version 1 of my program, I get the value X1. Then I run it on the version 2 and get the value X2.

How can I tell if the difference between the two results is significant or not? In addition, how can I objectively know how many times I need to run the benchmarks to get useful results?

These questions might be of interest: "What to do when the means of two samples are significantly different but the difference seems too small to matter", "Determine if a heavy tailed distributed process has improved significantly", — , Jul 27 '12 at 12:14
When making a statistical comparison you usually need some idea about the spread of the measured quantities. Are you able to extract the run times from each repetition, so that the spread can be investigated? — MånsT, Jul 27 '12 at 12:14
How many repetitions are you able to do within resonable time? 10? 100? 10 million? — MånsT, Jul 27 '12 at 12:20
Basically I need to know how does the new version compares to the old one. I started using the difference of the means, but as I don't have knowledge in the statistics area, there are probably better means to quantify the difference. — Thomas, Jul 27 '12 at 12:29
I don't think they are paired: each data set has the same number of points, but they have no relation other than the order in which they are obtained. — Thomas, Jul 27 '12 at 12:37
Thank you @Procrastinator! I'm reading your second link. Are the methods described here applicable for non-fat-tailed distribution? — Thomas, Jul 27 '12 at 12:43
@Thomas Definitely because the methods in the second link are nonparametric, then there are no assumptions about the distribution of the variables (except that they are continuous). This method works nicely, specially with moderate sample sizes (as in this case). You can check it by using rnorm instead of rcauchy or replacing these vectors with your observations. — , Jul 27 '12 at 12:45
I'm recreating two data sets and try your answer with R. I will keep you posted! — Thomas, Jul 27 '12 at 13:11
While it's generating I'm reading articles about the process. In your code you used the normal distribution approximation to compute the optimal bandwidth, but how do we know that the underlying density is Gaussian? — Thomas, Jul 27 '12 at 13:35
@Thomas What is used in this algorithm is a "Gaussian Kernel" together with its corresponding optimal bandwidth, you can use any other distribution with an appropriate bandwith. This type of estimation is called KDE. There is no assumption on the distribution of the variable and the kernel used is just an ingredient of KDE. There are actually two ingredients in KDE, the kernel and the bandwidth. What really matters is the choice of the bandwidth, which is a very active research area btw. — , Jul 27 '12 at 13:52

How to compare the results of two benchmarks?

0 Answers0