CLT, Euclidean Distances and Means

Question

This may sound like a rather odd question but I have stumbled upon a distribution of means that resembles a normal, but I am not sure whether it is possible to prove that it is indeed one or not. Here is the setup.

Suppose I take a sample of $n$ independent and identically distributed random vectors/variables $ x_1, x_2,...,x_n$ (I did use a specific distribution but I don't think that matters). We now find the Euclidean distance of each random variable/vector to all the other variables and find an average. Thus, for each variable $j = 1,...,n$ in the sample we have:

$$ Y_j = \frac{1}{n-1} \sum_{i \neq j} ||x_j,x_i|| $$

Now if I increase $n$ and repeatedly plot the distribution of these variables, I get something that increasingly looks like a normal distribution. Is this a fluke or is there something deeper here that I am not understanding? After all, the $Y_j$ variables are not independent of one another neither are the Euclidean distances.

EDIT: To give more details for context. Each random vector $\mathbf{x_i}$ has 8094 components. Each component $m$ is derived independently as follows:

$$ x_{im} = \min \{z_1,...,z_{K_m}\} $$

Here $z_1,...,z_{K_m}$ are draws from a discrete uniform with parameters $[1,108]$. I have a fixed vector which specifies the $K_m$ for each component (i.e. the number of draws for each component).

Here is the relevant R code for the simulation. I note that total is a dataframe of 8094 rows and one column called num. Any vector of integers should do the job.


simulated_vectors <- replicate(10000, {
samples <- lapply(total$num, function(n) sample(1:108, n, replace = TRUE))
  min_values <- sapply(samples, min)
  return(min_values)
})
mat <- t(simulated_vectors)
distances <- dist(mat)
distances_matrix <- as.matrix(distances)
average_distances <- apply(distances_matrix, 1, function(row) {
  (sum(row) - min(row)) / (length(row) - 1)
})
hist(average_distances, main = "Histogram of Average Distances", xlab = "Average Distance")

this is not mean, as the number of pairs is greater than $n$ — Aksakal, Dec 31 '23 at 00:15
That's true but I am finding the average distance a variable is from all others. That is why I would end up with $n$ average distances. — DarkenExcalibur, Dec 31 '23 at 01:43
More details would be helpful. For small vector dimensions it is geometrically evident, and easily demonstrable with simulation, that these mean distances shouldn't generally look at all Normal. — whuber, Dec 31 '23 at 17:34
also what does it mean to "look like normal" to you? Chi distribution looks normal visually for some parameters — Aksakal, Dec 31 '23 at 20:50
Very true, I will provide more details in an edit. If it's Chi-squared that's great too, but unlikely. — DarkenExcalibur, Dec 31 '23 at 22:19
I have also added some R code to repeat what I am doing. Hopefully that should make the explanation make more sense too. — DarkenExcalibur, Dec 31 '23 at 22:54
(1) Please warn your readers that your code will take a long time to run-- and might take forever depending on values in total$num. Do us all a favor by streamlining it. Here is an adequate version that runs in under one second: df <- data.frame(num = rpois(8094, 3) + 1); n.sim <- 200; n <- 108; X <- sapply(df$num, \(k) ceiling(n * (1 - runif(n.sim)^(1/k)))); d <- colSums(as.matrix(dist(X))) / n.sim; qqnorm(d) (2) See https://stats.stackexchange.com/a/236631/919 and https://stats.stackexchange.com/a/451376/919, inter alia. Change 8094 to, say, 2 in the code and rerun it. — whuber, Jan 01 '24 at 15:16

CLT, Euclidean Distances and Means

0 Answers0