Suitable hypothesis test for non-normal, discrete data with repetitions

Question

In the field of cryptography, one is concerned that algorithms execute in so-called "constant time" (more precisely, independent of the input data, or at least the parts of the data which are secret; variability in execution time correlated with public data is acceptable). Otherwise, this may leak data that an attacker may use to mount a timing side-channel attack and extract e.g. the secret key.

The following three representative histograms (out of a larger ensemble of different parameters and implementations) show performance measurements (CPU cycle counts) for each of 1 million executions of a given algorithm. For each execution, a random bit is sampled to decide whether the input data from that execution will be all-zeros (darker shade) or a random input (lighter shade). While sometimes difficult to spot, there are actually two highly-overlapping data sets.

Some observations about the data:

Some histograms (such as the first one) appear to be Gaussian, while others are bimodal and as such, decidedly not Gaussian.
The data is discrete, as CPU clock cycles are discrete events.
Cycle counts are always clustered in a small range, say on the order of dozens to low hundreds of clock cycles. Seeing as there are a million measurements, it is inevitable that there will be repetitions.
Due to the randomization, there isn't a perfect 500,000 : 500,000 split of measurements for each of zeros and random data, but a small fluctuation (on the order of a few hundred measurements) in favor of one or another.

Given previous knowledge about our code and the hardware it runs on, I would expect for there to be no difference in execution times regardless of input data (that is, the code is actually constant time). The histograms appear to confirm this, at least to the naked eye. However, I would like to run a statistical hypothesis test to formalize this, but I am having difficulties finding an appropriate test.

Point 1 above rules out parametric tests (such as Student's t-test).

Point 2 rules out tests that assume continuous distributions, such as a Kolmogorov-Smirnov test.

Point 3 rules out tests that assume no repetitions in the input data, such as Wilcoxon's signed-rank test.

Also, I believe the experiment design is such that samples are unpaired. Perhaps if I interleaved zero/random/zero/random/zero/... inputs then I could claim each zero-input measurement is paired to the subsequent random-input measurement, but given the randomization, the subsequent input has a 50% chance of being a zero or random input.

I am looking for suggestions on suitable hypothesis test for my experimental design. As a bonus, I would love to know whether my choice of 1 million samples is insufficient, adequate or excessive for this experiment, given the characteristics of the data implied by the histograms above. I have no idea how to calculate statistical power.

A possible solution

After much searching, I found two possible options: a modification of the Kolmogorov-Smirnov test for discrete distributions, as implemented by the R KSgeneral package for the discrete distribution case, and the Anderson-Darling test for $k$-samples, as implemented by scipy.

The KS test is rather slow for my large sample sizes (although it wouldn't mind be a problem it run for a day or so to collect all values if necessary), and I have seen an effect where restricting the sample size (still to a fairly large 50,000 samples for each class), I do not reject the null hypothesis, while for larger values it is more likely to reject, similar to what is observed in this CrossValidated question.

As for the Anderson-Darling $k$-sample test, quoting from the SciPy documentation:

It tests the null hypothesis that k-samples are drawn from the same population without having to specify the distribution function of that population.

Also:

midrankbool, optional

Type of Anderson-Darling test which is computed. Default (True) is the midrank test applicable to continuous and discrete populations. If False, the right side empirical distribution is used.

Finally, there is the following note:

[1] defines three versions of the k-sample Anderson-Darling test: one for continuous distributions and two for discrete distributions, in which ties between samples may occur. The default of this routine is to compute the version based on the midrank empirical distribution function. This test is applicable to continuous and discrete data. If midrank is set to False, the right side empirical distribution is used for a test for discrete data. According to [1], the two discrete test statistics differ only slightly if a few collisions due to round-off errors occur in the test not adjusted for ties between samples.

[1] Scholz, F. W and Stephens, M. A. (1987), K-Sample Anderson-Darling Tests, Journal of the American Statistical Association, Vol. 82, pp. 918-924.

After running the test on all of my histograms (more precisely, on the data sets used to generate each histogram), using the full sample set for each case, I get large $p$-values for all data sets -- many of which capped at 0.25 due to a limitation of the test or its SciPy implementation. Only one of the bimodal data sets resulted in $p = 0.00235$, although looking at the histogram it does not appear structurally different from the others. I guess with a "large" ensemble of data sets (26 histograms in total) it is not completely unexpected that some would reject the null hypothesis purely out of chance (obligatory XKCD reference).

I know this suspiciously sounds like $p$-hacking, trying out all possible tests until one returns what I expect, but as a matter of fact even Student's t-test already does -- I just feel I'm violating too many of its assumptions to be a valid result.

To conclude, a more specific question: is Anderson-Darling's $k$-samples test suitable for my experiment?

Note: a previous version of the question included information on an attempt using the chi-square test. I deleted it since the question was getting too long. Feel free to look at the edit history if necessary.

Student/s t-test works well enough for discrete distrtibutions as long as the samples are large enough. Your samples seem to be very large. A permutations test is a non-parametric alternative that does not care about ties. — Michael Lew, Feb 14 '24 at 20:38
Thank you for the comment. Given the bimodal characteristic, is Student's t-test still indicated? As for the permutations test, assuming it's the test implemented here, I suppose the suggestion would be to use e.g. the mean as a statistic, so as to test the hypothesis that the means are identical? — swineone, Feb 14 '24 at 21:49
Student's t-test is a test of means. If the population(s) have an extremely bimodal distribution then any analysis that does not deal with the two masses separately is likely to leave some information on the table. The mean of a bimodal distribution is not often very interesting by itself. — Michael Lew, Feb 15 '24 at 01:12
This seems like a research question that should involve collaboration with a statistician. — John Madden, Feb 16 '24 at 15:56
Student's t will be fine theoretically for comparing means in such large samples (CLT and all that), however you need to decide whether comparing means is what you want. Student will not find differences other than differences in means. Are you fine with that? That is basically your call, based on background knowledge and how the result will be used. Also of course (as you apparently know already) with large samples you always have the chance to reject based on a so small deviation from the H0 that it is practically meaningless. This however is not a problem as long as you don't reject. — Christian Hennig, Feb 16 '24 at 17:02
If you are interested in detecting any difference between the distributions, your Anderson-Darling (AD) proposal looks fine. Although once more with such data sets finding a significant difference doesn't necessarily imply that it is meaningful. Also AD will look for more than just mean differences and may be less powerful than Student if what you are really interested in is mean differences (or generally "monotonic" differences). — Christian Hennig, Feb 16 '24 at 17:06
@ChristianHennig thank you for the comments. I don't feel that comparing means is enough; imagine that in histogram 3 the "zeros" data set looked as it does, while the "random" data set had the left mass shifted say 10 cycles to the right, and the right mass shifted say 20 cycles to the left (or whatever is needed to keep the mean constant). Then I would be able to distinguish between "zeros" and "random" inputs by looking at the spread of the bimodal distribution, and this might lead to a timing attack. So I guess I really need to make sure the shape is approximately the same as well. — swineone, Feb 17 '24 at 01:52
@ChristianHennig so let's assume I go with Anderson-Darling, but as you said, it's going to pick up even the smallest differences. Can I somehow either (1) quantify this difference so I can evaluate whether it's relevant, or (2) calculate how many samples I should use (presumably less than the 1,000,000 samples I'm currently using), so that it is unlikely to pick up on differences smaller than say 1 cycle? I know it's clear from the histograms that a 1 cycle difference appears impossibly unlikely, but I'd still like to have a formal test confirm that for me. — swineone, Feb 17 '24 at 07:25
@swineone In principle the difference is quantified by the AD test statistic. This of course needs interpretation. How big a difference is "relevant" depends on your specific application for which I'm not an expert. If you want to decide between attack or not you need to think about what size of differences an attack is likely to produce. I can't tell you this. If you understand the relevant effect size, you can just estimate the effect size and see whether it's too big. You don't need to lower your sample size for this (unless you insist that you need "significance" for making a decision). — Christian Hennig, Feb 17 '24 at 10:31
(In principle you could even still run a significance test on the full sample having specified a minimum relevant size of deviation from the H0, however this is nonstandard and requires more thought than you should expect somebody on a free Q&A site to invest.) — Christian Hennig, Feb 17 '24 at 10:33
Please be aware of multiple testing! If you run that many different tests one of them will fail to reject the null hypothesis just by chance. Have a look at the Alpha-error accumulation and how to counter it with the Bonferroni correction of the significance level — Ggjj11, Feb 17 '24 at 12:45

score 2 · Accepted Answer · edited Feb 19 '24 at 02:14

Why don't you try the nonparametric permutation test?

Compute Test Statistic: Calculate the test statistic from the observed data.
Permutation: Randomly shuffle the observed data.
Calculate Permutation Test Statistic: For each permutation, compute the test statistic using the shuffled data.
Compare Statistic Distributions: Compare the distribution of permuted test statistics with the observed test statistic.
Calculate P-Value: Determine the probability of observing a test statistic as extreme as the one obtained, based on the permutation distribution.
Draw Conclusion: If the p-value is below the preselected significance level, reject the null hypothesis; otherwise, fail to reject it.

For a more detailed description, see https://en.m.wikipedia.org/wiki/Permutation_test. In R, the package coin does the job (see vignettes https://cran.r-project.org/web/packages/coin/index.html), so does scipy in python (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.permutation_test.html).

I don't see the assumptions of a permutation test being violated here.

Example test statistics could be the difference in the mean or any quantile of the run time. Or it could be the Mann–Whitney U test statistic / Wilcoxon rank-sum test statistic.

Suitable hypothesis test for non-normal, discrete data with repetitions

A possible solution

1 Answers1

Linked