1

I am trying to determine the area of the ellipse which contains the true mean of $(X_n, Y_n)_{1\leq n \leq N}$ with a probability of 90%. The 90% confidence ellipse area, or C90 area. The sample consists of N = 3000 $X,Y$ coordinate pairs. The problem is that definitions I've come across differ in how they end up calculating the actual area of this ellipse. After assuming this and that about the sample ( distribution etc ), the main steps are as follows :

  1. Determine the covariance matrix of the sample. In this case its a $2\times 2$ matrix

  2. Determine the eigenvalues of this matrix $\lambda_1, \lambda_2$

  3. The square roots of the eigenvalues correspond to the axes of the ellipse. Hence the area can be calculated as $C90 = \pi\chi^2\sqrt{\lambda_1\lambda_2}$, where $\chi^2=4.605$

Step 3 is the one I'm confused about. I've looked at a couple of sources and each seems to compute the area differently. Some sources also claim that this corresponds to the area of the prediction ellipse, not the confidence ellipse.

My question is whether this 3 step procedure to find the C90 area is correct, or is this the area of the prediction ellipse.

Edit : updated my definition of the C90 area.

  • What you are describing is not a confidence ellipse, which is an algorithm that contains the true parameter pair in 95% of cases if we replicate the experiment often enough. A confidence ellipse has no reason to contain a specific percentage of the observations, and will in fact shrink as we collect more and more data, so contain fewer and fewer of the observations - simply because it is about parameters, not observations. This is similar to the relationship between quantiles and a confidence interval in a single dimension. – Stephan Kolassa Dec 12 '22 at 11:58
  • It sounds like there are potentially two separate questions here. (1) Determining (in terms of semimajor and semiminor axes and the center) the smallest ellipse that covers 90% of a given set of two-dimensional data points. (2) Given the semimajor and semiminor axes of an ellipse, determining its area. Part (1) is hard, and I haven't found anything with a quick search. There is something on covering all points, but I don't see a "quantile" version. Part (2) is easy. Can you clarify? – Stephan Kolassa Dec 12 '22 at 12:08
  • I'm updating my definition of C90 area and editing the question. Thanks for the clarification – In the blind Dec 12 '22 at 12:21
  • OK, thank you, that is helpful. The algorithm does not give you a confidence ellipse. You can tell because the confidence ellipse should shrink as the sample size $n$ increases (because the means are better and better estimated), but there is no influence of $n$ in your calculation (other than in the estimation of the covariance matrix, but that is just that: the estimation of parameters that will not tend to zero as $n$ grows). Let's see whether we can find something... – Stephan Kolassa Dec 12 '22 at 12:33
  • It looks like we don't have anything here, which is a bit surprising to me. I would assume this to be treated in standard statistical textbooks. The documentation to R's car::dataEllipse() function, may be helpful, especially its references. – Stephan Kolassa Dec 12 '22 at 12:43
  • @Stephan On the contrary, see the last line of the code at https://stats.stackexchange.com/a/34468/919 as well as the explanation at https://stats.stackexchange.com/a/67429/919. – whuber Dec 12 '22 at 15:07

1 Answers1

1

The principle behind the $4.605$ value is that the sum of the square of two normal distributed variables (also known as a $\chi^2(\nu = 2)$ variable) is 90% of the time below $4.605$. The area of that circle is $4.605 \pi$.

example

The term $\sqrt{\lambda_1\lambda_2}$ relates to a stretching of the variables into an elipse.


What your $C90$ value refers to depends on what the values $\lambda_1$ and $\lambda_2$ refer to. It seems like these $\lambda$ are estimates of the population variance and in that case the interval relates to a prediction interval. It can also be that the $\lambda$ relate to the variance of the estimate and in that case the interval relates to a confidence interval.

This is similar to the difference between the variance of the population and the variance of the estimate of the mean of the population (these two differ by a factor $\sqrt{n}$).

  • A small detail: When we create a confidence interval then we use the t-distribution instead of a z-distribution. For the multivariate case I actually don't know how this is done. But, when $n$ is large then the two methods are approximately the same. – Sextus Empiricus Dec 12 '22 at 16:47