This may sound like a rather odd question but I have stumbled upon a distribution of means that resembles a normal, but I am not sure whether it is possible to prove that it is indeed one or not. Here is the setup.
Suppose I take a sample of $n$ independent and identically distributed random vectors/variables $ x_1, x_2,...,x_n$ (I did use a specific distribution but I don't think that matters). We now find the Euclidean distance of each random variable/vector to all the other variables and find an average. Thus, for each variable $j = 1,...,n$ in the sample we have:
$$ Y_j = \frac{1}{n-1} \sum_{i \neq j} ||x_j,x_i|| $$
Now if I increase $n$ and repeatedly plot the distribution of these variables, I get something that increasingly looks like a normal distribution. Is this a fluke or is there something deeper here that I am not understanding? After all, the $Y_j$ variables are not independent of one another neither are the Euclidean distances.
EDIT: To give more details for context. Each random vector $\mathbf{x_i}$ has 8094 components. Each component $m$ is derived independently as follows:
$$ x_{im} = \min \{z_1,...,z_{K_m}\} $$
Here $z_1,...,z_{K_m}$ are draws from a discrete uniform with parameters $[1,108]$. I have a fixed vector which specifies the $K_m$ for each component (i.e. the number of draws for each component).
Here is the relevant R code for the simulation. I note that total is a dataframe of 8094 rows and one column called num. Any vector of integers should do the job.
simulated_vectors <- replicate(10000, {
samples <- lapply(total$num, function(n) sample(1:108, n, replace = TRUE))
min_values <- sapply(samples, min)
return(min_values)
})
mat <- t(simulated_vectors)
distances <- dist(mat)
distances_matrix <- as.matrix(distances)
average_distances <- apply(distances_matrix, 1, function(row) {
(sum(row) - min(row)) / (length(row) - 1)
})
hist(average_distances, main = "Histogram of Average Distances", xlab = "Average Distance")
total$num. Do us all a favor by streamlining it. Here is an adequate version that runs in under one second:df <- data.frame(num = rpois(8094, 3) + 1); n.sim <- 200; n <- 108; X <- sapply(df$num, \(k) ceiling(n * (1 - runif(n.sim)^(1/k)))); d <- colSums(as.matrix(dist(X))) / n.sim; qqnorm(d)(2) See https://stats.stackexchange.com/a/236631/919 and https://stats.stackexchange.com/a/451376/919, inter alia. Change8094to, say,2in the code and rerun it. – whuber Jan 01 '24 at 15:16