A set of data points lie in a $\mathbb{R}^n$ space. If for every possible direction $\theta$ I calculate the standard deviation of these points along that direction $\sigma_\theta$, and put a dot at distance $\sigma_\theta$ from the origin, along direction $\theta$, will I get an ellipsoid?
-
Maybe this is just a terminology issue, but what do you mean by "points along that direction"? Do you mean the standard deviation of the projection of the data set onto $\theta$? – shadowtalker Dec 12 '15 at 02:18
-
If you are referring to the shape of a cloud of points, then the shape is influenced by the respective distributions, variances and correlations. Assuming normality, zero correlation and equal variances you will get a sphere. Change the correlation or one of the variances and you get an ellipse. See for example:http://www.garp.org/media/422474/feb10%2052-53_quant.pdf – spdrnl Dec 12 '15 at 03:07
-
@spdrnl I believe some care is needed here, because the prescription in this question does not give ellipses. It is difficult to see how a normality assumption could apply to any finite set of data points, which cannot possibly have a perfectly Normal distribution. – whuber Dec 12 '15 at 16:14
1 Answers
The idea's on the right track but the formula does not give an ellipsoid.
In searching for a simple, clear, and convincing reason that would apply in any dimension, I thought the following might work.
Every direction is given by a unit-length direction vector $v$: the values of the data points $x_1, \ldots, x_N$ in that direction are their dot products with $v$, written $x_1\cdot v, \ldots, x_N \cdot v$, each of which is a linear combination of the components of $v$. Because the variance
$$V(v) = \text{Var}(x_1\cdot v, \ldots, x_N \cdot v)$$
is a homogeneous quadratic function of these dot products, it (therefore) will be a homogeneous quadratic function of $v$--even when $v$ is an arbitrary vector of any length. Consequently, level sets of $V$ must be quadratic hypersurfaces, or quadrics. Since (assuming the points are not confined to a lower-dimensional affine subspace of $\mathbb{R}^n$) $V$ is bounded, the only possible such hypersurfaces are ellipsoids.
Consider, then, an arbitrary direction $v$. Fixing a constant positive value $C$ for the level surface, there will be a unique positive multiple $t$ for which $tv$ lies on the level surface. That is, $$C = V(tv) = t^2 V(v).$$ Writing $\text{SD}(v) = \sqrt{V(v)}$ for the standard deviation of the data in direction $v$, the solution is
$$t = \sqrt{\frac{C}{V(v)}} = \frac{\sqrt{C}}{\sqrt{V(v)}} = \frac{\sqrt{C}}{\text{SD}(v)}.$$
This says that to get an ellipse, you should put a dot a distance proportional to $1/\text{SD}(v) = 1/\sigma_\theta$ from the origin. Equivalently, putting a dot at distance proportional to $\sigma_\theta$ produces a kind of "inverse ellipse." An example will easily demonstrate this curve is not an ellipse.
Here, for $n=2$ dimensions, is a plot of three points $x_1=(0,0), x_2=(1,0), x_3=(0,1)$ and the two curves in question: the red dashed curve at distances proportional to the standard deviation and the solid blue curve at distances proportional to the reciprocal of the SD. It is obvious which one is the ellipse!
To make it perfectly clear what this figure shows, here is the Mathematica code that produced it.
x = {{0, 0}, {0, 1}, {1, 0}};
Show[
Graphics[{PointSize[0.015], Point[x]}],
ParametricPlot[{1/StandardDeviation[x.{Cos[t], Sin[t]}] {Cos[t], Sin[t]},
StandardDeviation[x.{Cos[t], Sin[t]}] {Cos[t], Sin[t]}}, {t, 0, 2 \[Pi]},
PlotStyle -> {Directive[Thick], Directive[Thick, Dashed]}]]
- 322,774
