2

There are cases (arguably the vast majority) where the data distribution is unknown. Confidence intervals make sense for the class of "nicely" defined theoretical pdfs ie Gaussian. In this case, we know that additional samples help to derive the characteristics (statistical moments) of the pdf. However, there are other distributions in which additional sampling is not helping to converge to estimates but on the contrary diverge, for example, Cauchy distribution enter image description here

sources:

So with this context, my question is: Why do we need to add confidence intervals (or the equivalent reporting mean and variance) to model predictions in cases we don't know the true data distribution?

I would like to focus the question in the context of Machine learning inference evaluation scores a common example is to add +/- in the score interval usually by rerunning the model, with some different random seed conditions.

enter image description here

(note that this is distinct question from from k-fold method discussed here: https://datascience.stackexchange.com/questions/108792/why-is-the-k-fold-cross-validation-needed but it could be also a relevant method there if the data is violating some of Central limit theorem conditions such as being independently and identically distributed. )

  • 3
    Are you asking about confidence intervals or prediction intervals? There is a difference. I don't think many people give PIs for Cauchys, so it seems like you are looking at CIs... but then again, "mean and variance" does sound more like PIs. – Stephan Kolassa Nov 14 '22 at 10:53
  • My question is more about ML model predictions. ie people report scores +/- for different instances of the model configured with the same hyperparameters but different random seeds to initialize weights and random states in various parts. – partizanos Nov 14 '22 at 19:08
  • That is yet a third possible meaning: a range of point predictions for different randomizations. This is neither a confidence interval nor a prediction interval. Is it this you are asking about? – Stephan Kolassa Nov 14 '22 at 19:42
  • Yes and I see it in quite some papers I will put an example in the question to demonstrate the case. – partizanos Nov 14 '22 at 21:36
  • @StephanKolassa I added an example – partizanos Nov 14 '22 at 21:48
  • The Cauchy distribution is not a "typical" non-Gaussian distribution, but is outright evil. Things that lead to disaster with Cauchy can still work reasonably well with all kinds of other non-Gaussian distributions for which moments exist. In reality almost all measurements are bounded in some way so that moments will in fact exist (although there can of course be extreme outliers, which need to be dealt with). – Christian Hennig Nov 15 '22 at 10:55
  • If your data have extreme outliers (as any not very small sample from the Cauchy will have), you shouldn't just compute means in the first place, but rather go for something more robust. – Christian Hennig Nov 15 '22 at 10:57

2 Answers2

2

We only assume that the distribution is Gaussian for convenience, and mathematical results like the Central Limit Theorem tell us that assuming a Gaussian is a good approximation to the true distribution of uncertain quantities.

If we did not make simplifying assumptions then we would not be able to communicate any uncertainty in our predictions. Yes, our mean and variance estimates are wrong (or perhaps, imperfect would be a better word), but reporting only a point estimate ignores all uncertainty, which is clearly much worse than making some simplifying assumptions.

jcken
  • 2,907
  • Thank you for the response, I understand that if we have a suspicion that a random variable converges CLT is super valid. However, it is not always the case and there are random Variables for which this assumption is strong.
    I would be very interesting in ways to assert convergence . Simplifying analogy filling the ocean with 1 drop or 2 drops doesn't make you more sure of that. I appreciate and value pdf as the theoretical circle (useful although it doesn't exist) but can you use a drop-by-drop for problems that you don't know if they are ocean big?
    – partizanos Nov 14 '22 at 14:39
1

Apart from simulated examples, we never know the true distribution of the quantities we are looking at, whether these are point predictions or summary statistics like the AUROC you give as an example. We don't even know whether they satisfy the conditions for the Central Limit Theorem so that we could assume asymptotic normality - and even if we could, we rarely know how good this asymptotic approximation is.

However, this does not matter. Rarely does reporting a mean plus/minus a standard deviation imply a confidence interval. Rather, it is simply a description of the variability of the quantity being reported. Thus, it is an example of descriptive statistics, rather than inferential statistics.

Of course, if someone claims a confidence interval with specific properties, e.g., based on a normal distribution with estimated means and variances, then your point certainly holds: insofar as we don't know the true distribution, we have to trust in the CLT that we are at least approximately right.

This makes more sense in certain applications than in others, and knowing when it does and when it does not is part of the domain knowledge of specific areas of application in statistics. For instance, since the AUROC is bounded, it is certainly not Cauchy, and I would have no qualms about accepting an asymptotically normal distribution of AUROCs, with corresponding confidence intervals, as long as the sample size is large enough.

Stephan Kolassa
  • 123,354
  • Thank you for the response. Indeed AUROC cannot be Cauchy, and it cannot be also Gaussian (which is also unbounded ). The "evil part" of Cauchy distribution is the fact that increasing the number of samples makes our score (in AUROC case) estimate to diverge. This could be a characteristic of the whole family of pdfs - I might be wrong but I guess there can exist some bounded - non-converging pdfs also. For a complex model where the random seeds enter in many places - leading partially to a replication crisis - how can one know what sample size is sufficient and drops-in -ocean situation. – partizanos Nov 15 '22 at 15:08
  • 1
    Yes, AUROC cannot be Cauchy nor Gaussian. But that does not matter. What is relevant is that average AUROC, averaged over many runs, will be asymptotically Gaussian, since the preconditions of the CLT are (IMO) met. For a bounded random variable not to be subject to the CLT would require something extremely strange to go on. – Stephan Kolassa Nov 15 '22 at 15:12
  • I really appreciate the answers and the links. Some thoughts: for f(X) = Y, with f a model, X a set of random seeds in N, and Y the codomain AUC in [0,1] the a bounded interval. We have a "causality" X-> Y, do the preconditions of CLT is that the samples of the variable under study are i.i.d. . In the problem we discuss do we need to X and/ or Y random variables to be i.i.d? If it's only X then yes CLT conditions are met considering the usage of a random generator. But maybe Y (AUC scores) are not totally independent (mode,data, hyper/parameters) – partizanos Nov 16 '22 at 14:24