I'm trying to understand the output of the Kolmogorov-Smirnov test function (two samples, two sided). Here is a simple test.
x <- c(1,2,2,3,3,3,3,4,5,6)
y <- c(2,3,4,5,5,6,6,6,6,7)
z <- c(12,13,14,15,15,16,16,16,16,17)
ks.test(x,y)
# Two-sample Kolmogorov-Smirnov test
#
#data: x and y
#D = 0.5, p-value = 0.1641
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(x, y) : cannot compute exact p-value with ties
ks.test(x,z)
#Two-sample Kolmogorov-Smirnov test
#data: x and z
#D = 1, p-value = 9.08e-05
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(x, z) : cannot compute exact p-value with ties
ks.test(x,x)
#Two-sample Kolmogorov-Smirnov test
#data: x and x
#D = 0, p-value = 1
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(x, x) : cannot compute exact p-value with ties
There are a few things I don't understand here.
From the help, it seems that the p-value refers to the hypothesis
var1=var2. However, here that would mean that the test says (p<0.05):a. Cannot say that
X = Y;b. Can say that
X = Z;c. Cannot say that
X = X(!)
Besides appearing that x is different from itself (!), it is also quite strange to me that x=z, as the two distributions have zero overlapping support. How is that possible?
According to the definition of the test,
Dshould be the maximum difference between the two probability distributions, but for instance in the case(x,y)it should beD = Max|P(x)-P(y)| = 4(in the case whenP(x),P(y)aren't normalized) orD=0.3(if they are normalized). Why D is different from that?I have intentionally made an example with many ties, as the data I'm working with have lots of identical values. Why does this confuse the test? I thought it calculated a probability distribution that should not be affected by repeated values. Any idea?