6

I have applied the one sample Kolmogorov Smirnov test of normality to two variables and one has a larger p value but both are greater than .05. e.g.,

  • $x_1$ (p-value) = 0.09
  • $x_2$ (p-value) = 0.06

Does this mean that $x_1$ is better or more normal than $x_2$?

whuber
  • 322,774
  • 1
    You could compare skewness and kurtosis of the empirical distribution for both variables to see which one is closer to the normal (assuming both are unimodal). QQ-plots would also help in investigating which one is "more normal". – Momo Jun 29 '12 at 21:28
  • On a side note, the Kolmogorov-Smirnov test for normality tends to have low power and can't be recommended (if you think that testing for normality is a good idea in the first place). The Anderson-Darling and Shapiro-Wilk tests are much better in terms of power. – MånsT Jul 12 '12 at 05:22

3 Answers3

12

Read this article:

Murdock, D, Tsai, Y, and Adcock, J (2008) _P-Values are Random
 Variables_. The American Statistician. (62) 242-245.

It talks about the fact that if the null hypothesis is true then the p-value is a uniform random variable. This means that you are just as likely to see a p-value of 0.09, 0.06, 0.01, or 0.99. (when the null is false in a way that the test is designed to detect then the p-value is a random variable with values closer to 0 being more common).

Here is an added example (in R):

> out1 <- replicate(1000, ks.test(rt(100, 10), pnorm)$p.value )
> out2 <- replicate(1000, ks.test(rt(100, 50), pnorm)$p.value )
> 
> mean(out1 < out2)
[1] 0.522

This simulated data from a t distribution with 10 df and a t distribution with 50 degrees of freedom and does the ks-test on each simulated dataset and gets the p-value. Then it looks at each pair and sees how often the p-value for 10df is smaller than the p-value for 50df (the 50df should be "more normal" than the 10 df). But the simulations only get this right 52.2% of the time, only slightly better than flipping a coin. I would not want to base any important decisions on something like this.

Now if you are comparing something that is very non-normal to something close to normal, then the p-values will probably show this, but then a simple histogram or qqplot would also make this obvious.

Greg Snow
  • 51,722
  • (+1) To the answer. Also, in order to convince yourself, run a simulation! This helped me a lot when studying significance testing, and have been also helpful for the students I've come to help :-). – Néstor Jun 29 '12 at 20:18
  • I like what you say but what do you think about "p-value ratios"? I was in a biomedical conference recently and in a poster, a biostatistician used ratio of p-values as a way to compare effect sizes! It didn't make much sense to me but what do I know, I have no formal training in statistics. – AlefSin Jun 29 '12 at 20:41
  • 1
    In general, @AlefSin, it makes no sense to compare p-values (they are estimates! Would it make sense to compare just the means of two samples? No. You'll usually perform a t-test if you are ok by assuming the samples come from a normal distribution). It is, however, a common mistake. – Néstor Jun 29 '12 at 21:15
  • 2
    @AlefSin remember that p-values are just a monotonic transformation of the test statistic. The advantage of p-values are that they are uniformly distributed between 0 and 1 if the null (and other assumptions) are true and they are on the 0 to 1 scale. If you are interested in comparing effect sizes, why not just compare the effect sizes (or the normalized test statistic)? I see no advantage of converting to p-values 1st. – Greg Snow Jun 30 '12 at 18:15
  • The fact that the KS test can barely discriminate between the t distributions with 10 and 50 degrees of freedom is only a reflection on the close similarity of those distributions and the low power of the test. – Michael Lew Jul 11 '12 at 21:19
  • @MichaelLew yes, if you see that as a problem then just choose 2 distributions, one of which is "more normal" than the other, plug them into the code above (or equivalent) and see how often the "more normal" distribution has a larger p-value. Can you find a meaningful comparison that would not be simpler/better compared using standard plots? – Greg Snow Jul 12 '12 at 17:42
  • @GregSnow The low power of tests for normality is very problematic where users test the normality of small samples and use the failure to find a small P value as a reason to claim that the distribution is close enough to normal. In many cases the conclusion is only a result of the low power of the test. Graphical methods are preferable in my opinion because they force the user to see the paucity of information in the small sample as well as the form of any potential deviation from the hypothesised distribution. (I don't understand your question, but assume that we are actually in agreement.) – Michael Lew Jul 12 '12 at 21:14
  • @MichaelLew I think we are arguing the same side now. I originally interpreted your comment as suggesting that the original posters strategy might work in cases where there is higher power. – Greg Snow Jul 13 '12 at 15:39
4

In general, the lower the p-value, the less belief you attach to your null hypothesis (in fact, the p-value is the chance that, if the null hypothesis were true, a test statistic so extreme (or more) as the one obtained from your sample would be obtained).

As such, it is reasonable to say that the lower the p-value, the more confident you are that there may be an alternative out there that is more probable to give this extreme statistic. As we are typically aiming to dis"prove" the null hypothesis (e.g. show that a coefficient in a regression is not zero), typically we say that lower p-values imply better results.

With the K-S test, it's a bit different: in fact, here, we typically hope that the null hypothesis is true. Therein lies the problem: at "best" we can say there is overwhelming evidence that the null hypothesis is not true (when the p-value is really low), or that the test we used did not provide evidence against the null hypothesis (e.g. if you find a p-value of 0.5). Unfortunately, there is nothing to say that there isn't an alternative out there (for K-S it could be e.g. the T-distribution instead of normal) that would give even better results!

In this manner, it is not a good idea to call the higher p-value "a better result". At most you could say that there is "less evidence against" its null hypothesis.

If there is some sound reason for applying the hard threshold of 5% (which in truth generally is rather arbitrary), it doesn't matter anyway, like you indicate.

Nick Sabbe
  • 12,819
  • 2
  • 37
  • 47
  • 1
    I do not agree with your first statement that "In general, the lower the p-value, the less belief you attach to your null hypothesis". The p-value depends on too many things and, as Greg Snow pointed out in his answer, p-values are inherently random. Given that fact, how could you even compare them just knowing their realizations? That makes no sense. – Néstor Jun 29 '12 at 20:19
  • 1
    @Néstor - If we adopt the Fisherian interpretation of an extreme p-value - either something unusual has happened or the model is wrong - a smaller p-value should increase our belief that the model is wrong, albeit informally, and there's no need for the belief to increase only as the p-value crosses, e.g., 0.05. I'd be a lot happier w/ a p-value of 0.0007 than a p-value of 0.048 if my objective is to confirm the existence of an effect. In the OP's case, though, we could say the empirical DF is closer to a fitted Normal in case 1 than case 2 using the sup-norm, and skip p-values altogether. – jbowman Jul 12 '12 at 00:23
  • @jbowman let's adopt the Fisherian interpretation then: how would you prove that 0.0007 is more significant than 0.048? (as, say, in a significance test). – Néstor Jul 12 '12 at 00:39
  • 1
    @Nestor - Arggh! I was just a few characters short of finishing my reply to your earlier comment! I wouldn't prove it; as with Fisher, it's completely informal. The use of things like the 0.05 cutoff comes about by a happy coincidence that the p-value for many tests corresponds, numerically, to a test statistic in Neyman-Pearson hypothesis testing that can be compared directly to a significance level, but conceptually they are different. If we wish to make a formal significance test, I agree, you can't, based on the p-value alone. – jbowman Jul 12 '12 at 00:45
  • @jbowman I'm sorry to have erased it, I realized it had a ridiculous example (I was actually on the phone while writing it and when I read it after hanging down the phone I laughed at how ridiculous the example that I proposed was). And yeah, I actually read that the Neyman-Pearson and the Fisherian interpretation in significance testing are incompatible... – Néstor Jul 12 '12 at 00:53
  • @Nestor - no problem, I've done that myself. – jbowman Jul 12 '12 at 01:01
1

The smaller p value represents stronger evidence against the null hypothesis, but it may not be that the first distribution is "better or more normal" than the second. Instead, it may be less easily distinguished from a normal distribution.

Note that the amount of evidence against the null hypothesis in a p value of 0.06 or 0.09 is quite small. However, if your samples are small then the power of the K-S test to provide evidence is also small.

Michael Lew
  • 15,102
  • How can you say that? How do you know that the difference between two p-values is significant or not? Do you have a null if the p-values are different? Equal? Can you perform significance testing? I say again: it makes no sense to compare p-values. – Néstor Jul 12 '12 at 00:14
  • @Néstor Which "that" are you asking about? 1. The significance of the difference between the P values is irrelevant (and might indicate that you are conflating significance tests and Neyman-Pearson hypothesis tests). 2. Yes, you do have a null hypothesis, but it may be false. Why does that matter to comparing P values? 3. No, I don't think that you can perform significance testing on P values, but you can make the relevant likelihood functions in many cases. Comparing P values does not need significance testing or hypothesis testing, and the original question does not ask for it. – Michael Lew Jul 12 '12 at 02:38