9

What's the most suitable statistical test for testing whether the distribution of the (x,y) coordinates of the blue points is significantly different from the distribution of the (x,y) coordinates of the red points? I'd also want to know the directionality of this difference. The colored data points are those data points with labels, with the label for blue being distinct from the label for red. White data points are just unlabeled, so could very well be ignored.

enter image description here

whuber
  • 322,774
user1447630
  • 1,059
  • I don't know if this means you are satisfied with my answer to the nonlinear association question and therefore are searching for an example that doesn't have an existing method. – Michael R. Chernick Sep 08 '12 at 00:14
  • I don't really understnad your question. What do you mean by significantly greater? Are you trying to get at a test for a data set having a stronger correlation than another? What is the reason for the white dots? Are they just there to obscure the pattern? – Michael R. Chernick Sep 08 '12 at 00:17
  • It looks to me like the red dots form a pattern similar to the parabolic shape in your previous post but not with a single location to separate monotonic pieces and the blue look sort of linear with a little scatter from a line. – Michael R. Chernick Sep 08 '12 at 00:19
  • @MichaelChernick this question is independent of my previous post although that doesn't discount the possibility of overlap in answers. – user1447630 Sep 08 '12 at 11:47

4 Answers4

6

A good test provides insight as well as a quantification of the apparent difference. A permutation test will do that, because you can plot the permutation distribution and it will show you just how and to what extent there is a difference in your data.

A natural test statistic would be the mean difference between the points in one group relative to those in the other -- but with little change you can apply this approach to any statistic you choose. This test views group membership arising from the random selection of (say) the red points among the collection of all blue or red points. Each possible sample yields a value of the test statistic (a vector in this case). The permutation distribution is the distribution of all these possible test statistics, each with equal probability.

For small datasets, like that of the question ($N=12$ points with subgroups of $n=5$ and $7$ points), the number of samples is small enough you can generate them all. For larger datasets, where $\binom{N}{n}$ is impracticably large, you can sample randomly. A few thousand samples will more than suffice. Either way, these distributions of vectors can be plotted in Cartesian coordinates, shown below using one circular shape per outcome for the full permutation distribution (792 points). This is the null, or reference, distribution for assessing the location of the mean difference in the dataset, shown with a red point and red vector directed towards it.

enter image description here

When this point cloud looks approximately Normal, the Mahalanobis distance of the data from the origin will approximately have a chi-squared distribution with $2$ degrees of freedom (one for each coordinate). This yields a p-value for the test, shown in the title of the figure. That's a useful calculation because it (a) quantifies how extreme the arrow appears and (b) can prevent our visual impressions from deceiving us. Here, although the data look extreme--most of the red points are displaced down and to the left of most of the blue points--the p-value of $0.156$ indicates that such an extreme-looking displacement occurs frequently among random groupings of these twelve points, advising us not to conclude there is a significant difference in their locations.


This R code gives the details of the calculations and construction of the figure.

#
# The data, eyeballed.
#
X <- data.frame(x = c(1,2,5,6,8,9,11,13,14,15,18,19),
                y = c(0,1.5,1,1.25, 10, 9, 3, 7.5, 8, 4, 10,11),
                group = factor(c(0,0,0,1,0,1,1,1,1,0,1,1), 
                               levels = c(0, 1), labels = c("Red", "Blue")))
#
# This approach, although inefficient for testing mean differences in location,
# readily generalizes: by precomputing all possible
# vector differences among all the points, any statistic based on differences
# observed in a sample can be easily computed.
#
dX <- with(X, outer(x, x, `-`))
dY <- with(X, outer(y, y, `-`))
#
# Given a vector `i` of indexes of the "red" group, compute the test 
# statistic (in this case, a vector of mean differences).
#
stat <- function(i) rowMeans(rbind(c(dX[i, -i]), c(dY[i, -i])))
#
# Conduct the test.
#
N <- nrow(X)
n <- with(X, sum(group == "Red"))
p.max <- 2e3  # Use sampling if the number of permutations exceeds this
# set.seed(17)
if (lchoose(N, n) <= log(p.max)) {
  P <- combn(seq_len(N), n)
  stitle <- "P-value"
} else {
  P <- sapply(seq_len(p.max), function(i) sample.int(N, n))
  stitle <- "Approximate P-value"
}
S <- t(matrix(apply(P, 2, stat), 2)) # The permutation distribution
s <- stat(which(X$group == "Red"))   # The statistic for the data
#
# Compute the Mahalanobis distance and its p-value.
# This works because the center of `S` is at (0,0).
#
delta <- s %*% solve(crossprod(S) / (nrow(S) - 1), s)
p <- pchisq(delta, 2, lower.tail = FALSE)
#
# Plot the reference distribution as a point cloud, then overplot the 
# data statistic.
#
plot(S, asp = 1, col = "#00000020", xlab = "dx", ylab = "dy",
     main = bquote(.(stitle)==.(signif(p, 3))))
abline(h = 0, v = 0, lty = 3)
arrows(0, 0, s[1], s[2], length = 0.15, angle = 18, 
       lwd = 2, col = "Red")
points(s[1], s[2], pch = 24, bg = "Red", cex = 1.25)
whuber
  • 322,774
4

A typical way to test if two one-dimensional distribution functions are different is with the Kolmogorov-Smirnov test which is based on the statistic:

$\begin{align*} \underset{x}{\operatorname{sup}}\:|F_1(x) - F_2(x)| \end{align*} $

The problem is that in higher dimensions there are $2^d-1$ ways to define a distribution function. There are a number of papers on higher dimensional KS tests. Below is a link for one that discusses some efficient methods carrying out such a test.

Two-Dimensional KS Test

muratoa
  • 811
  • Is there some way of implementing a 2D KS test (any variation) in R? Package? – user1447630 Sep 09 '12 at 17:42
  • One popular implementation is based on Fasano, Franceschini (1987) this site contains the references and some matlab code http://www.subcortex.net/research/code/testing-for-differences-in-multidimensional-distributions. I don't know of any implementations in R. – muratoa Sep 09 '12 at 20:40
2

Given the small sample size in your graph, the Wilcoxon rank-sum test seems appropriate to compare the y values in the red and blue groups.

RobertF
  • 6,084
  • 2
    The OP is not comparing the distributions of two one-dimensional variables. He is comparing two curves based on two-dimensional points. That is why I find the term "significantly greater" unclear. – Michael R. Chernick Sep 08 '12 at 05:16
  • 1
    Also from the context of previous post I think he is referring to measures of correlation and he want to know which data set exhibits the highest correlation in some "nonlinear sense". – Michael R. Chernick Sep 08 '12 at 05:29
  • @MichaelChernick that's correct. I want to compare two curves based on two-dimensional points. However, I've rephrased my question to reflect the fact that basically I want to know whether the distribution of the (x,y) coordinates of the blue points is significantly different from the distribution of the (x,y) coordinates of the red points. I'd also want to know the directionality of this difference. – user1447630 Sep 08 '12 at 11:44
  • @user1447630 My comments were directed toward RobertF who in answering your question interpreted "significantly greater" to mean numerically higher in aone-dimensional sense when you acyually meant something different. – Michael R. Chernick Sep 08 '12 at 12:01
0

I just read this and if you're trying to determine whether these points are from different clusters in 2D space I'd recommend simply taking a multivariate approach on this.

Using discriminant analysis (DA) on the x,y coordinates as Y (covariates) and using color as the X (categories). The area under the resulting ROC curve would be a good indication on whether or not these points were from different clusters.

User1865345
  • 8,202