1

In a software application, I have access to the order statistics (without any extra computational cost) of a collection of $N$ samples (where $N$ can vary and is often quite large, usually on the order of $10^4$). I would like to quickly determine a rough and approximate degree of normality for the samples, without incurring a huge computational overhead. Note that I do not need a formal boolean test of normality; quantitative confidence or p-values are not important here. I only need some metric or test statistic that will approximately vary along with the degree of normality of the current collection of samples.

I have looked at some options in the literature and fear they may incur too much processing time (and be overkill for my needs, as I only need something loosely approximate). For example, the Anderson–Darling test statistic involves repeated ln() and erf() calls (please correct me if I'm wrong), which are not very cheap for large $N$. The Shapiro–Wilk test statistic seems somewhat promising, but I am unclear on how much is involved in calculating the coefficients $\alpha_i$. It looks like it could be complex (or involve a very large lookup table?)--and would these coefficients need to be recalculated completely if $N$ changes (which can happen frequently in my application)?

The Shapiro-Francia test seems powerful and simple, but there is no closed-form analytic expression for the values of $m_{i:n}$ required by the test. Is there a good, computationally cheap approximation for the $i^{th}$ mean of the order statistics for large $N$?

Maybe the Kolmogorov-Smirnov test statistic is my best bet, because the only "expensive" function needed is erf() and there may be ways to cheaply scale/interpolate these normal CDF values for changing $N$. For example, maybe I can precompute it for a very large $N$ and then downscale/interpolate quickly to any smaller $N$.

Is there any computationally cheap test statistic I can calculate, maybe even just using empirical z-scores or something? Any help is appreciated.

  • 4
    Given the discussion on this page, what are you trying to accomplish with this normality testing? – EdM Apr 05 '22 at 20:50
  • I am not doing normality testing. No true/false formal statement ever needs to be made on if the samples are objectively normal with any certain degree of confidence. Instead I need a metric (or "statistic" or "feature") that will approximately vary along with how well the dataset approximately fits a normal distribution. It could be considered a realtime feature over a constantly changing set of samples. New samples will arrive and old ones will drop off. For example, the value could increase as the set better fit a normal distribution and the value can decrease the less this is the case. – Special Sauce Apr 06 '22 at 01:48
  • $10^4$ is not large, computationally speaking, for functions like $\ln$. For example, I just computed 100,000 logs of exponential variates in 0.004 seconds on my home PC. – jbowman Apr 07 '22 at 14:55
  • Yes, the ln() call in itself is not a problem. But the erf() is expensive if called repeatedly. – Special Sauce Apr 07 '22 at 19:11

2 Answers2

3

Here is a sketch of a possible methodology.

  • Set up $2k+1$ bins of equal probability from inverse cumulative normal distribution.

  • Estimate the mean $\hat{\mu}$ and variance $\hat{\sigma}$.

  • Standardize the sample points, $z_i = \frac{x_i-\hat{\mu}}{\hat{\sigma}}$

  • Count the number of standardized sample points $z_i$ in each bin $j$ to get empirical probability distribution by bin, accumulate bin probability $P_j$ with $\frac{1}{N}$ increment per data point in bin $j$.

  • Use the discrete KL-divergence measure as your statistic. KL-divergence is asymmetric, use empirical distribution for $P_j$, $\frac{1}{2k+1}$ for $Q_j$.

  • $D_\text{KL}(P \parallel Q) = \sum\limits_{j=1}^{2k+1} P_j \log\left(\frac{P_j}{Q_j}\right) ~.$

You can set up the bins once (small constant overhead), the other operations are proportional to the number of data points, so your $O(N)$ requirement should be realized. I would not worry about computational costs of functions like log() for $2k+1$ bins.

krkeane
  • 2,190
  • 1
    This is an interesting proposition. By fixing the number of bins, the calculation dependency on the particular current value of $N$ is greatly reduced. Do you have any estimate or heuristic to use for the total number of bins? It seems like it cannot be too high (due to too few samples per bin). But maybe being too low decreases the power of the test? Any ideas on a formula to maybe guide the number of bins, given an estimated average $N$ value? – Special Sauce Apr 07 '22 at 19:05
  • This is a quick and dirty proposal. OP mentioned 10^4 sample size. So 101 bins gives nearly 100 expected per bin if near normal. Using 0 = 0 ln(0), zeroes aren't really a problem. I'd have to code it up to get better intuition. – krkeane Apr 07 '22 at 19:16
  • I would not categorize the proposed method as a test, its more of a metric useful for ranking a standardized empirical distribution's closeness to the standard normal distribution. Given the proposed method's only intent is to rank, I don't think the exact number of bins will change the output rankings much. – krkeane Apr 07 '22 at 20:39
  • I misspoke when I said "decreases the power of the test". I meant to say "decreases the discrimination of the metric". I will be running some experimental tests with this metric and should have results in a few days. So far, this answer seems very appealing. – Special Sauce Apr 07 '22 at 22:54
  • Based on your application needs, you can certainly construct a "test" by specifying an ad hoc threshold for $D_{\textrm{KL}}\left(P||Q\right)$. – krkeane Apr 08 '22 at 13:13
  • Thanks for the excellent suggestion. I posted the results of my experiments below. The solution works great and has low computational complexity across a very wide range of $N$. And the optimal number of bins for my discrimination purposes is only ~10, regardless of $N$. – Special Sauce Apr 10 '22 at 07:09
2

I have found the Kullback–Leibler divergence (a measure of statistical distance) suggested by @krkeane in the accepted answer to be an effective and computationally cheap solution. This implementation does not actually require the order statistics, and in fact it requires less computation to compute it on the unordered sample set. Pre-sorting the samples requires $O(N*log(N))$ for the total computation, but computing this measure on the unordered samples requires only $O(N*log(B))$ in total, where $B$ is the number of bins and the optimal $B \ll N$ (as I show below). $O(N*log(B))$ is due to performing a $log(B)$ binary search over the bins for each unordered point in $N$. I've also found that $B$ need not be an odd value, as originally proposed; even values also work fine.

I have performed a series of experiments to determine the optimal $B$ value for discriminating between random normal samples ($\mathcal{N}$), random uniform samples ($\mathcal{U}$), and a 50%-50% mixture of the two distributions ($\mathcal{N}\mathcal{U}$). I performed these experiments for $N=1k$, $N=10k$, and $N=100k$ sample points. I averaged the results over 500 trials for each tested number of bins ($B$).

Somewhat unexpectedly, I found that optimal $B$ does not appear to depend on $N$, and always seems to be around 10 or 11 (at least for discriminating $\mathcal{N}$ vs. $\mathcal{U}$ vs. $\mathcal{N}\mathcal{U}$, with $N = [1k, 100k]$). I optimized $B$ against a certain measure of discrimination: $\displaystyle \operatorname*{argmax}_B min(ratio, 1-ratio)$, where $ratio = \left.\frac{(D_\text{KL}(\mathcal{N}\mathcal{U} \parallel Q) - D_\text{KL}(\mathcal{N} \parallel Q))}{(D_\text{KL}(\mathcal{U} \parallel Q) - D_\text{KL}(\mathcal{N} \parallel Q))}\right|_{bins=B, D_\text{KL}(\mathcal{U} \parallel Q) \ge D_\text{KL}(\mathcal{N}\mathcal{U} \parallel Q) \ge D_\text{KL}(\mathcal{N} \parallel Q)}$ and $ratio = 0$ was assigned if the expected ranking of the three $D_\text{KL}$ values was violated.

A $min(ratio, 1-ratio) = 0.0$ represents the worst possible discrimination between the three sample distributions, and $min(ratio, 1-ratio) = 0.5$ represents the best possible discrimination. I found that higher $N$ lowered the variance of the results and increased the maximum possible discrimination $min(ratio, 1-ratio)$, but did not affect the optimal $B$ at all. Please see the charts below for a summary of the results. Having optimal $B \approx 10$ means that this measure can be computed very efficiently over an unordered sample set for a huge range of $N$, which was precisely the objective.

A typical result with $N=100k$ and $B=10\\ D_\text{KL}(\mathcal{N} \parallel Q) = 0.000036\\ D_\text{KL}(\mathcal{N}\mathcal{U} \parallel Q) = 0.015816\\ D_\text{KL}(\mathcal{U} \parallel Q) = 0.028814$

B for 1k samples

B for 10k samples

B for 100k samples


UPDATE

After further testing, I have learned that the optimal $B$ actually depends on exactly how the two differing distributions overlap. For example:

$\mathcal{N}(\mu=0,\,\sigma^{2}=1), \mathcal{U}[-1, 1]$: gives optimal $B=10$ (and the charts given above). $\mathcal{N}(\mu=0,\,\sigma^{2}=1), \mathcal{U}[-2, 2]$: gives optimal $B=36$ (and a skewed appearance to the charts).

However, in all cases tested so far, the $B$ "sweet spot" has been between 10 and ≈40, and does not appear to depend on $N$. And $B=10$ still performs well, across the board, even if it is not always the exact optimal value for the two given differing distributions.