In the field of cryptography, one is concerned that algorithms execute in so-called "constant time" (more precisely, independent of the input data, or at least the parts of the data which are secret; variability in execution time correlated with public data is acceptable). Otherwise, this may leak data that an attacker may use to mount a timing side-channel attack and extract e.g. the secret key.
The following three representative histograms (out of a larger ensemble of different parameters and implementations) show performance measurements (CPU cycle counts) for each of 1 million executions of a given algorithm. For each execution, a random bit is sampled to decide whether the input data from that execution will be all-zeros (darker shade) or a random input (lighter shade). While sometimes difficult to spot, there are actually two highly-overlapping data sets.
Some observations about the data:
- Some histograms (such as the first one) appear to be Gaussian, while others are bimodal and as such, decidedly not Gaussian.
- The data is discrete, as CPU clock cycles are discrete events.
- Cycle counts are always clustered in a small range, say on the order of dozens to low hundreds of clock cycles. Seeing as there are a million measurements, it is inevitable that there will be repetitions.
- Due to the randomization, there isn't a perfect 500,000 : 500,000 split of measurements for each of zeros and random data, but a small fluctuation (on the order of a few hundred measurements) in favor of one or another.
Given previous knowledge about our code and the hardware it runs on, I would expect for there to be no difference in execution times regardless of input data (that is, the code is actually constant time). The histograms appear to confirm this, at least to the naked eye. However, I would like to run a statistical hypothesis test to formalize this, but I am having difficulties finding an appropriate test.
Point 1 above rules out parametric tests (such as Student's t-test).
Point 2 rules out tests that assume continuous distributions, such as a Kolmogorov-Smirnov test.
Point 3 rules out tests that assume no repetitions in the input data, such as Wilcoxon's signed-rank test.
Also, I believe the experiment design is such that samples are unpaired. Perhaps if I interleaved zero/random/zero/random/zero/... inputs then I could claim each zero-input measurement is paired to the subsequent random-input measurement, but given the randomization, the subsequent input has a 50% chance of being a zero or random input.
I am looking for suggestions on suitable hypothesis test for my experimental design. As a bonus, I would love to know whether my choice of 1 million samples is insufficient, adequate or excessive for this experiment, given the characteristics of the data implied by the histograms above. I have no idea how to calculate statistical power.
A possible solution
After much searching, I found two possible options: a modification of the Kolmogorov-Smirnov test for discrete distributions, as implemented by the R KSgeneral package for the discrete distribution case, and the Anderson-Darling test for $k$-samples, as implemented by scipy.
The KS test is rather slow for my large sample sizes (although it wouldn't mind be a problem it run for a day or so to collect all values if necessary), and I have seen an effect where restricting the sample size (still to a fairly large 50,000 samples for each class), I do not reject the null hypothesis, while for larger values it is more likely to reject, similar to what is observed in this CrossValidated question.
As for the Anderson-Darling $k$-sample test, quoting from the SciPy documentation:
It tests the null hypothesis that k-samples are drawn from the same population without having to specify the distribution function of that population.
Also:
midrankbool, optional
Type of Anderson-Darling test which is computed. Default (True) is the midrank test applicable to continuous and discrete populations. If False, the right side empirical distribution is used.
Finally, there is the following note:
[1]defines three versions of the k-sample Anderson-Darling test: one for continuous distributions and two for discrete distributions, in which ties between samples may occur. The default of this routine is to compute the version based on the midrank empirical distribution function. This test is applicable to continuous and discrete data. If midrank is set to False, the right side empirical distribution is used for a test for discrete data. According to[1], the two discrete test statistics differ only slightly if a few collisions due to round-off errors occur in the test not adjusted for ties between samples.
[1]Scholz, F. W and Stephens, M. A. (1987), K-Sample Anderson-Darling Tests, Journal of the American Statistical Association, Vol. 82, pp. 918-924.
After running the test on all of my histograms (more precisely, on the data sets used to generate each histogram), using the full sample set for each case, I get large $p$-values for all data sets -- many of which capped at 0.25 due to a limitation of the test or its SciPy implementation. Only one of the bimodal data sets resulted in $p = 0.00235$, although looking at the histogram it does not appear structurally different from the others. I guess with a "large" ensemble of data sets (26 histograms in total) it is not completely unexpected that some would reject the null hypothesis purely out of chance (obligatory XKCD reference).
I know this suspiciously sounds like $p$-hacking, trying out all possible tests until one returns what I expect, but as a matter of fact even Student's t-test already does -- I just feel I'm violating too many of its assumptions to be a valid result.
To conclude, a more specific question: is Anderson-Darling's $k$-samples test suitable for my experiment?
Note: a previous version of the question included information on an attempt using the chi-square test. I deleted it since the question was getting too long. Feel free to look at the edit history if necessary.


