1

This may sound like a rather odd question but I have stumbled upon a distribution of means that resembles a normal, but I am not sure whether it is possible to prove that it is indeed one or not. Here is the setup.

Suppose I take a sample of $n$ independent and identically distributed random vectors/variables $ x_1, x_2,...,x_n$ (I did use a specific distribution but I don't think that matters). We now find the Euclidean distance of each random variable/vector to all the other variables and find an average. Thus, for each variable $j = 1,...,n$ in the sample we have:

$$ Y_j = \frac{1}{n-1} \sum_{i \neq j} ||x_j,x_i|| $$

Now if I increase $n$ and repeatedly plot the distribution of these variables, I get something that increasingly looks like a normal distribution. Is this a fluke or is there something deeper here that I am not understanding? After all, the $Y_j$ variables are not independent of one another neither are the Euclidean distances.

EDIT: To give more details for context. Each random vector $\mathbf{x_i}$ has 8094 components. Each component $m$ is derived independently as follows:

$$ x_{im} = \min \{z_1,...,z_{K_m}\} $$

Here $z_1,...,z_{K_m}$ are draws from a discrete uniform with parameters $[1,108]$. I have a fixed vector which specifies the $K_m$ for each component (i.e. the number of draws for each component).

Here is the relevant R code for the simulation. I note that total is a dataframe of 8094 rows and one column called num. Any vector of integers should do the job.


simulated_vectors <- replicate(10000, {
samples <- lapply(total$num, function(n) sample(1:108, n, replace = TRUE))
  min_values <- sapply(samples, min)
  return(min_values)
})

mat <- t(simulated_vectors) distances <- dist(mat)

distances_matrix <- as.matrix(distances)

average_distances <- apply(distances_matrix, 1, function(row) { (sum(row) - min(row)) / (length(row) - 1) })

hist(average_distances, main = "Histogram of Average Distances", xlab = "Average Distance")

  • this is not mean, as the number of pairs is greater than $n$ – Aksakal Dec 31 '23 at 00:15
  • That's true but I am finding the average distance a variable is from all others. That is why I would end up with $n$ average distances. – DarkenExcalibur Dec 31 '23 at 01:43
  • 3
    More details would be helpful. For small vector dimensions it is geometrically evident, and easily demonstrable with simulation, that these mean distances shouldn't generally look at all Normal. – whuber Dec 31 '23 at 17:34
  • also what does it mean to "look like normal" to you? Chi distribution looks normal visually for some parameters – Aksakal Dec 31 '23 at 20:50
  • Very true, I will provide more details in an edit. If it's Chi-squared that's great too, but unlikely. – DarkenExcalibur Dec 31 '23 at 22:19
  • I have also added some R code to repeat what I am doing. Hopefully that should make the explanation make more sense too. – DarkenExcalibur Dec 31 '23 at 22:54
  • 1
    (1) Please warn your readers that your code will take a long time to run-- and might take forever depending on values in total$num. Do us all a favor by streamlining it. Here is an adequate version that runs in under one second: df <- data.frame(num = rpois(8094, 3) + 1); n.sim <- 200; n <- 108; X <- sapply(df$num, \(k) ceiling(n * (1 - runif(n.sim)^(1/k)))); d <- colSums(as.matrix(dist(X))) / n.sim; qqnorm(d) (2) See https://stats.stackexchange.com/a/236631/919 and https://stats.stackexchange.com/a/451376/919, inter alia. Change 8094 to, say, 2 in the code and rerun it. – whuber Jan 01 '24 at 15:16

0 Answers0