5

Ward (1963) provides a commonly used criterion for hierarchical clustering. It's based on the following definition (p. 237):

Given a set of ratings for 10 individuals, $\{2, 6, 5, 6, 2, 2, 2, 0, 0, 0\}$, a common practice is to use the mean value to represent all the scores rather than to consider individual scores. The "loss" in information that results from treating the 10 scores as one group with a mean of 2.5 can be indicated by a "value-reflecting" number, the error sum of squares (ESS).

The error sum of squares is given by the functional relation,

$$\text{ESS} = \sum_{i=1}^n x_i^2 - \frac{1}{n}\left( \sum_{i=1}^n x_i \right)^2$$

where $x_i$ is the score of the $i$th individual. The ESS for the example is […] 50.5.

If somebody asked me how to quantify the loss of information incurred by representing a vector with its mean, I'd say the SD or variance. Or if you wanted the sum of squares rather than the mean of squares, you'd multiply the variance by the sample size, and get $\sum_{i=1}^n (x_i - \bar{x})^2$. This is the sum of squared distances from the mean. So why would one use Ward's ESS instead of one of these quantities?

Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244. doi:10.2307/2282967. Retrieved from https://web.archive.org/web/20050312103440/http://iv.slis.indiana.edu/sw/data/ward.pdf

amoeba
  • 104,745
Kodiologist
  • 20,116
  • But it is just the same as your formula of ESS. Many statistical programs compute ESS and then variance (or covariance) the way shown by Ward, i.e. without centering of the data values. It is faster. See footnote 1 here. – ttnphns Nov 14 '17 at 22:07
  • See also the last sentence in here https://stats.stackexchange.com/a/237811/3277. Within-group scatter (or scatter matrix), that is, SSerror, or W, can be computed without centering the data. – ttnphns Nov 14 '17 at 22:19

2 Answers2

5

\begin{align} \operatorname{Var}(\vec x) \propto \sum_{i=1}^n(x_i - \bar x)^2 &= \sum_i x_i^2 - 2\bar x \sum_ix_i + n \bar x^2 \\ &= \sum_i x_i^2 - n \bar x^2 = \sum_i x_i^2 - \frac 1n \left(\sum_i x_i\right)^2 = \text{ESS}. \end{align}

I think $ESS$ is more sensible when talking about compression because $ESS = ||\vec x - \bar x \mathbf 1||^2_2$ so this coincides with the usual norm on $\mathbb R^n$.

jld
  • 20,228
  • +1. This answer would look even more classy without "Does that answer your question or did I miss something?" :-) – amoeba Nov 14 '17 at 21:55
  • @amoeba i didn't mean that to be rude, i was just concerned that i missed some aspect of the question. Also thank you for the better formatting. – jld Nov 14 '17 at 21:57
  • 1
    I did not mean to say that it was rude! Merely that an answer without any words looks cool IMHO. But of course additional explanations are welcome nevertheless. – amoeba Nov 14 '17 at 22:00
  • 1
    @amoeba ah i see what you mean :) sort of like the infamous W: https://math.stackexchange.com/questions/74347/construct-a-function-which-is-continuous-in-1-5-but-not-differentiable-at-2/74383#74383 – jld Nov 14 '17 at 22:02
  • Hahaha did not see that one. – amoeba Nov 14 '17 at 22:04
  • Thanks. I'd checked with an example that the ESS wasn't just equal to $n$ times the variance, and got different answers, so evidently I just made a mistake. – Kodiologist Nov 14 '17 at 22:53
  • @Kodiologist definitely could have been a typo, or depending on the variance estimator in question it could have been $ESS = (n-1) Var$ – jld Nov 15 '17 at 19:01
2

Ward's ESS is the same as the SS you mention. If you distribute the terms in your formula you get:

$ \sum(x_i - \bar x)^2 = \sum x_i^2 + \sum \bar x^2 - 2 \bar x \sum x_i = \sum x_i^2 - n \bar x ^2 = \sum x_i^2 - (\sum x_i)^2 / n$