2

enter image description here

I have two distributions as the above image. They look similar to me. The first vector has ~18K data points and the second one has ~400 data points.

However, when I performed K-S test

Two-sample Kolmogorov-Smirnov test

data:  new_df$V1 and init_df$V1
D = 0.17726, p-value = 2.881e-11
alternative hypothesis: two-sided

The p-value is very small - and I could not explain why. Could you give me some hints?

Stephan Kolassa
  • 123,354
mommomonthewind
  • 987
  • 2
  • 11
  • 21

1 Answers1

3

Recall how works: you calculate a test statistic based on your underlying data. Here, the test statistic is the maximum difference between the two empirical cumulative distribution functions. You then add your knowledge about the (possibly asymptotic) distribution of this test statistic under a null hypothesis. The $p$ value is the tail probability of this distribution evaluated at your observed test statistic.

The problem is that if your sample size is large, the null distribution is highly concentrated. Put differently: if $n$ is large, then small differences become statistically significant. And your density estimates do show differences, and I wouldn't call them "small". See also How to choose significance level for a large data set?

As an example, let's consider two almost identical distributions: $N(0,1)$ and $N(0.2,1)$.

distributions

We will draw 18,000 samples from the first and 400 samples from the second distribution and calculate the Kolmogorov-Smirnov statistic and p value. We will repeat this exercise 1,000 times. Here is the distribution of the 1,000 p values:

histogram of p

929 out of 1,000 p values are less than 0.05.

I invite you to try the same exercise with smaller sample sizes than 18,000 and 400. The number of significant p values will go down.

Bottom line: p values are good to assess statistical significance, but this is not the same as practical or clinical significance.

nn_1 <- 18000
nn_2 <- 400

mean_1 <- 0
mean_2 <- 0.2

xx <- seq(-3,3.4,by=.01)
plot(xx,dnorm(xx,mean_1),xlab="",ylab="",type="l")
lines(xx,dnorm(xx,mean_2),col="red")

n_sims <- 1e3
pp <- rep(NA,n_sims)

for ( ii in 1:n_sims ) {
    set.seed(ii)
    pp[ii] <- ks.test(rnorm(nn_1,mean_1),rnorm(nn_2,mean_2))$p.value
}

hist(pp)
sum(pp<.05)
Stephan Kolassa
  • 123,354