9

I'm reflecting about statistical significance and I think a variable can be irrelevant and significant at the same time. But, how do I explain that? And if that were the case, should I include that variable to my regression model?

dipetkov
  • 9,805
  • 14
    A useful and interesting exercise is to generate an explanatory variable independently at random and re-fit a regression. Repeat many times. When, say, your standard for "statistical significance" is an $\alpha=0.05$ level, then you will observe that your random variable is "significant" in close to a fraction $\alpha$ of such fits. – whuber Jun 20 '22 at 13:04
  • 4
    An unintended lesson from this question seems to be that "statistically significant" has a precise definition but "irrelevant" means different things to different people. So poorly defined statements might not be very relevant (= "to the point"). – dipetkov Jun 21 '22 at 06:25
  • It very well may be that such an irrelevant-yet-significant Variable is actually portraying statistical significance of 'Inverse Relationship' of the context. – user361429 Jun 23 '22 at 17:19

5 Answers5

24

Here is a good example. Suppose you are interested in modelling the effect of ice cream sales on incidence of shark attacks. Now, clearly there is no association; buying ice cream in no way affects the incidence of shark attacks. However there is a third variable which affects both, namely the temperature outside.

On hot days, people will want ice cream and might also want to go swimming. Hence, hot days see increases in both ice cream sales and shark attacks (by virtue of more people going swimming and hence being at risk for an attack). This is known as confounding.

So clearly, ice cream sales is irrelevant when studying shark attacks but were we to regress shark attack numbers on ice cream sales we would find a significant result. However, that statistical significance is confounded by temperature, and so the result is meaningless.

EDIT: Because this has generated conversation around the interpretation of "irrelevant" I feel the need to make some additional comments and concede I should have said "it depends" (even though I object to the arguments made). If "relevant" is understood to mean "closely connected or appropriate to what is being done or considered" then "irrelevant" should be taken to mean "not closely connected or appropriate...". In this case, ice cream sales could be considered "relevant" since they would be correlated -- but I find interpreting "relevant" in a purely statistical sense to be an extremely narrow (and irregular) way to attach meaning to "relevant". None the less, it is an interpretation one may have.

If, however, you interpret relevance or "closely connected..." in a mechanistic sense -- i.e. how closely connected are ice cream sales and shark attacks in the sense that I could change the former and affect the latter or vice versa -- then ice cream sales would be considered irrelevant and my original comment applies.

  • 7
    Nice example and +1. What I find so interesting about this is that, if we didn’t know to look at temperature and only cared to predict shark attacks, we might be perfectly content to regress on ice cream sales, if such a model gives acceptable performance for our task. – Dave Jun 20 '22 at 04:22
  • 16
    Re "So, clearly:" if you are truly interested in predicting shark attacks and you found that ice cream sales had predictive value, why dismiss that as "irrelevant"?? Sure, there's no direct connection, but if ice cream sales is at least serving as a proxy for other variables you cannot measure, then wouldn't including them (with this interpretation) have some meaning? – whuber Jun 20 '22 at 13:06
  • 3
    @whuber I didn't say "predicting", I said something which could be interpreted as causal. And while ice cream sales may be related to shark attacks, banning ice cream is not going to remove the threat of shark attacks. – Demetri Pananos Jun 20 '22 at 13:30
  • 1
    @Dave True, however I was very deliberate in my language to not use the word "predict" so as to not be subject to this counter argument. – Demetri Pananos Jun 20 '22 at 13:33
  • 9
    "Interpreted as causal" is an extremely narrow way to attach meaning to "relevant." Moreover, that contradicts what you write at the outset, where you characterize your understanding in terms of an "association" only. – whuber Jun 20 '22 at 13:44
  • 1
    @whuber Then I suggest you write your own answer to supplement where you think mine is deficient. – Demetri Pananos Jun 20 '22 at 13:47
  • 5
    I was hoping for clarification from you, but implicit in your comment is that you accept that your post is problematic. On behalf of future readers, I am content to have pointed out the inconsistencies and potentially confusing statements that currently appear. – whuber Jun 20 '22 at 13:53
  • 2
    @whuber it isn't problematic, it is perfectly consistent even with the inclusion of the word association, your bar is perhaps too high. You're free to add a more nuanced and lengthy comment, but I am perfectly happy with this answer. – Demetri Pananos Jun 20 '22 at 14:02
  • 3
    "the result is meaningless", no, it is meaningful in the sense that there's correlation between the two. Statistical significance is "significance" only in the sense of correlation. And correlation is not causation. If your goal is to judge causation, then it's highly misleading to say that statistical significance of the result is meaningless -- performing the regression in the first place was meaningless, as that's not the way to determine causation. – R.M. Jun 20 '22 at 14:55
  • @R.M. I'm not sure wha you mean by "performing the regression in the first place was meaningless, as that's not the way to determine causation". Causal questions can be most certainly answered with regression, but more to your point performing the regression on the data in question would be meaningless, which is what I intended to say. The result [of the regression] is meaningless [as an answer to the causal question]. – Demetri Pananos Jun 20 '22 at 16:29
  • 1
    This example is not very clear. It is typical for explaining the 'correlation does not imply causation' situation, but relating to 'irrelevance' of a variable it is not clear whether this is the case. In the example "Suppose you are interested in modelling the effect of ice cream sales on incidence of shark attacks", it is important to know the context why one is interested in modelling this. Without it one can not know why the significant correlation would be considered irrelevant. When we suppose that we are interested in this modeling, then indirectly we state that there's relevance. – Sextus Empiricus Jun 21 '22 at 08:19
  • 1
    In short: Why are we interested in modelling if the modelling is irrelevant? This is a contradiction. – Sextus Empiricus Jun 21 '22 at 08:20
  • 2
    "interpreted as causal" it seems rather risky to use the word "causal" when talking about regression. Regression can say "Y can be explained by X" but not "Y is explained by X" because it is a correlation based analysis not a causal one. "Causal questions can be most certainly answered with regression" I strongly disagree, you can define "statistical causality" with regression, but it won't match the normal everyday idea of causality and will lead to misunderstanding. – Dikran Marsupial Jun 22 '22 at 16:13
8

"Relevant" and "irrelevant" are not well-defined statistical terms that everyone agrees on.

Case in point: the answer by @Demetri Pananos interprets "relevant" as "causal" and the answer by @Sextus Empiricus interprets "relevant" as "included in the model that we assume is the true — or at least a valid — model for the data". Both interpretations are meaningful but they are not equivalent.

Consider for example a (correctly specified) model for the response to a medical treatment vs placebo that includes gender and age effects: $E(Y) = \beta_a\textrm{age} + \beta_g\textrm{gender} + \beta_t\textrm{treatment}$. The effect of the treatment, $\beta_t$ is causal: it's the difference between the outcome if a patient is given the treatment vs if the patient is given a placebo. However the patient doesn't "cause" the outcome of his/her/their own treatment with personal biological and genetic characteristics. A better term for age and gender in this model is "covariates" or "effect modifiers". They are certainly relevant in order to prescribe each patient the most appropriate treatment.

Another popular attempt to specify what "relevant" means is to make a distinction between "practically significant" and "clinically significant" on one hand and "statistically significant" on the other hand. For example, an effect size might be statistically different from 0 but very small in magnitude (absolute value), so perhaps not particularly useful in understanding phenomenon X, the argument goes.

Practical vs Statistical significance
Is there a colloquial way of saying "small but significant"?

There are subtleties with this interpretation of relevance has as well:

Stop talking about “statistical significance and practical significance”

So one takeaway is: If someone uses the term "(ir)relevant", "(in)significant", etc. to discuss their analysis, ask them "What exactly do you mean when you say your result is [a generic term hinting at importance]?"

dipetkov
  • 9,805
  • The Stop talking about “statistical significance and practical significance” article treats practical significance as a binary statistical measure as opposed to a continuous practical / real-world one, and then proceeds to use that to argue that a practically significant result wouldn't be practically significant (although even the example of heart attack deaths may arguably not be all that practically significant, depending on the specifics). Very strange, and the strawmanning gets much worse when he gets to "the other reason". – NotThatGuy Jun 21 '22 at 13:03
  • @NotThatGuy I read your description and conclude -- it's naive to believe that that there is one definition of "important" / "relevant" that make sense for all the projects/studies in the universe. Hence, just be explicit at the start what "relevant" means in the particular case and there won't be need for a long discussion and people talking over each other to argue their point. – dipetkov Jun 21 '22 at 13:34
7

I think a variable can be irrelevant and significant at the same time. But, how do I explain that?

This can be explained by using the concept of type I errors.

Below is an example by repeating a t-test 1000 times where we test whether the random number generator has a mean different from zero. Say that we consider a p-value below 0.05 significant, the we find 41 times that the mean of the number generator is significantly different from zero.

example

p = rep(0,1000)
for (i in 1:1000) {
   set.seed(i) 
   p[i] = t.test(rnorm(100,0,1))$p.value
}

plot(p, bg = 1 + (p<0.05), col = 1 + (p<0.05), pch = 21, cex = 0.5, main = "p values in 1000 simulations of a t-test with different seed\n 41 cases of significant p<0.05 level", xlab = "seed number", ylab = "p-value") lines(c(0,1000),c(0.05,0.05), lty =2)

Significant observations may happen even when there is no true effect (the null hypothesis is true) because sampling is subject to random statistical variations.

The point of 'significance' is to tell whether some observation is extreme and express it in terms of a probability, but an extreme/significant observation does not mean that it is not able to occur when there is not a true (relevant) effect. An extreme/significant observation can happen by chance. (Like in the figure above it happened 41 times out of 1000 experiments)

So that is what significant means from a statistical point of view

an extreme observation that falls outside the range of likely statistical variations that might be expected.

Significance does not directly mean a 'relevant' variable.

This is a bit of a language problem as well. Statistical significance relates to probability of observations. It is not 'significance' as in 'important' as it is used in common language.

Significance is also mostly important when it is absent. Significance is more like a minimal condition for some observation/variable to be important, and it is not like a sufficient condition.

3

Complementary to Demetri's nice answer (+1):

Aside (unmeasured) confounding, we might have spurious associations that are not the same an unmeasured confounding. Two variables X and Y might be independent but have with similar covariance matrices that looks indistinguishable to confounding due to network or spatial structures. A reasonable fresh reference on the matter is: Network Dependence Can Lead to Spurious Associations and Invalid Inference (2020) by Lee & Ogborn where the authors refer to this phenomenon as "spurious associations due to dependence".

usεr11852
  • 44,125
1

The definition of significance is that there is a certain probablility (not more than $\alpha=0.05$) that the wrong conclusion was drawn. The result of a statistical test can be significant, i.e. it can state that whatever the respective test tests for holds with high probability.

For example, the test result can state that there is a high probability that the weight of individual apples follows a Gaussian distribution. Or that there is a high probability that the weight of individual apples does not follow a Gaussian distribution. (Note that one of those two cases is wrong, i.e. the test yielded a wrong result. The probability for a wrong but significant result is not more than $\alpha=0.05$.) Or the test result might state that there is a high probability that ice cream consumption and shark attack frequency are correlated. Or that there is a high probability that ice cream consumption does not cause shark attacks. Or that there is a high probability that shark attacks do not cause higher ice cream consumption. Depends on which statistical test is applied to what data.

However, I don't know what you mean by a variable being "significant". The result of a statistical test can be significant. One variable can play different roles in statistical tests.

root
  • 278