1

$$ y = (2,4,6,8,10) $$ $$ x_1 = (1,2,3,4,5) $$

Linear model:

$$ y = \beta_0 + \beta_1x_1 $$

  • p-value of $x_1$: <2e-16
  • $R^2$: 1.00
  • p-value of model: <2e-16 with 1 var, 3df

Why doesn't p-value tell us to reject this model and this variable until we increase size of n?

jtd
  • 579
  • 1
  • 4
  • 11
  • 2
    What's MLR? In a forum where for some "ML" means maximum likelihood (of course, they say) and for others it means machine learning (of course, they say) explaining your abbreviations does no harm and can defuse puzzlement. – Nick Cox Mar 11 '15 at 12:48
  • 2
    A P-value isn't a certification of whether your analysis is sensible (appropriate, well judged) or a quantification of how far it is sensible (etc.). It's just flagging here that a fit that good is unlikely to be a chance fluctuation with this sample size. Wouldn't you troubled if that were not the case, as it is a perfect fit? – Nick Cox Mar 11 '15 at 12:51
  • 1
    As question has been edited, earlier comments may appear puzzling. The short answer is that $P$-value is (highly) sensitive to small $n$; it is just not evident in the example you give. – Nick Cox Mar 11 '15 at 13:01
  • @NickCox: Apologies for abbreviation. Given sample size, pooled variance, and valid assumptions about normality, linearity, homoscedasticity, i.i.d., etc., can we say "the likelihood that this relationship is due to random chance--that $x_1$ neither causes $y$ (nor vice versa), nor shares a causal antecedent with $y$, is <2e-16"? (cf. https://stats.stackexchange.com/questions/141253/can-two-variables-be-perfectly-correlated-but-not-share-a-single-causal-chain-an) – jtd Mar 11 '15 at 13:11
  • 1
    That wouldn't be correct. No independent observer could say whether this is a chance relationship, a legitimate systematic relationship, or even something someone cooked up. Approach it "from the other direction." "IF there were NO relationship in the larger population, random samples of 5 would show this degree of linear connection in fewer than 2 of 10^16 instances." (Although not every software package would quantify it that way. E.g., SPSS reports no p-value at all.) – rolando2 Mar 11 '15 at 13:24
  • @NickCox - it's not that "a fit that good is unlikely to be a chance fluctuation with this sample size"; it's that "chance fluctuations around a condition of zero fit are unlikely to produce a fit this good with this sample size." – rolando2 Mar 11 '15 at 13:29
  • 1
    I agree with @Rolando2. No program can tell you just by looking at data anything about "causal antecedents" or causes. Nor is there a population of relationships, some of which are caused by "random chance", whatever that means, and some of which aren't. By the way, the precise P-value of the order of 1e-16 is suppositious, if only because nothing can be stronger than perfect fit. Unfortunately there is no wording for this that is simultaneously clear, correct and charming, as it is a kind of backwards logic (indeed to many people in statistical science, quite absurd!). – Nick Cox Mar 11 '15 at 13:29
  • 1
    @Rolando2 Yes; that is more accurate wording. I am reaching for paraphrases that will make some kind of sense at the level of this question and inadvertently showing that it's dangerous to do so. – Nick Cox Mar 11 '15 at 13:32
  • @rolando2 and NickCox: Thanks! I have tried to put your knowledge into an answer. – jtd Mar 11 '15 at 13:41

1 Answers1

0

Attempting to put the knowledge from @NickCox and @rolando2 into this answer:

The p-value of a multiple regression variable (or model) cannot tell an independent observer anything about causes, but it can say:

  • IF there were NO relationship in the population between $x_1$ and $y$, properly random samples of $n=5$ would show this degree of fit (or relationship) in fewer than (a suppositious)* 2e-16 of the samples.

*Note that a perfect fit between $x_1$ and $y$ in the question makes the p-value suppositious.

Please feel free to edit!

jtd
  • 579
  • 1
  • 4
  • 11