Suppose I have 2 sets:
Set A: number of items $n= 10$, $\mu = 2.4$ , $\sigma = 0.8$
Set B: number of items $n= 5$, $\mu = 2$, $\sigma = 1.2$
I can find the combined mean ($\mu$) easily, but how am I supposed to find the combined standard deviation?
Suppose I have 2 sets:
Set A: number of items $n= 10$, $\mu = 2.4$ , $\sigma = 0.8$
Set B: number of items $n= 5$, $\mu = 2$, $\sigma = 1.2$
I can find the combined mean ($\mu$) easily, but how am I supposed to find the combined standard deviation?
So, if you just want to have two of these samples brought together into one you have:
$s_1 = \sqrt{\frac{1}{n_1}\Sigma_{i = 1}^{n_1} (x_i - \bar{y}_1)^2}$
$s_2 = \sqrt{\frac{1}{n_2}\Sigma_{i = 1}^{n_2} (y_i - \bar{y}_2)^2}$
where $\bar{y}_1$ and $\bar{y}_2$ are sample means and $s_1$ and $s_2$ are sample standard deviations.
To add them up you have:
$s = \sqrt{\frac{1}{n_1 + n_2}\Sigma_{i = 1}^{n_1 + n_2} (z_i - \bar{y})^2}$
which is not that straightforward since the new mean $\bar{y}$ is different from $\bar{y}_1$ and $\bar{y}_2$:
$\bar{y} = \frac{1}{n_1 + n_2}\Sigma_{i = 1}^{n_1 + n_2} z_i = \frac{n_1 \bar{y}_1 + n_2 \bar{y}_2}{n_1 + n_2}$
The final formula is:
$s = \sqrt{\frac{n_1 s_1^2 + n_2 s_2^2+ n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1 + n_2 }}$
For the commonly-used Bessel-corrected ("$n-1$-denominator") version of standard deviation, the results for the means are as before, but
$s = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2 + n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1+n_2 - 1} }$
You can read more info here: http://en.wikipedia.org/wiki/Standard_deviation
This obviously extends to $K$ groups:
$$ s = \sqrt{ \frac{\sum_{k=1}^K (n_k-1)s_k^2 + n_k(\bar{y}_k-\bar{y})^2} {(\sum_{k=1}^K n_k) -1} }$$
I had the same problem: having the standard deviation, means and sizes of several subsets with empty intersection, compute the standard deviation of the union of those subsets.
I like the answer of sashkello and Glen_b ♦, but I wanted to find a proof of it. I did it in this way, and I leave it here in case it is of help for anybody.
So the aim is to see that indeed: $$s = \left(\frac{n_1 s_1^2 + n_2 s_2^2+ n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1 + n_2 }\right)^{1/2}$$
Step by step: $$\left(\frac{n_1 s_1^2 + n_2 s_2^2+ n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1 + n_2 }\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}(x_i - \bar{y_1})^2 + \sum_{i=1}^{n_2}(y_i - \bar{y_2})^2+ n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1 + n_2 }\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left((x_i - \bar{y_1})^2 + (\bar{y}_1-\bar{y})^2\right) + \sum_{i=1}^{n_2}\left((y_i - \bar{y_2})^2 + (\bar{y}_2-\bar{y})^2\right)}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 + 2\bar{y_1}^2 -2x_i\bar{y_1} -2\bar{y_1}\bar{y} \right)}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i^2 + \bar{y}^2 + 2\bar{y_2}^2 -2y_i\bar{y_2} -2\bar{y_2}\bar{y} \right)}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_1}\frac{x_j}{n_1}\right) + 2n_1\bar{y_1}^2 -2\bar{y_1}\sum_{i=1}^{n_1}x_i}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_2}\frac{y_j}{n_2}\right) + 2n_2\bar{y_2}^2 -2\bar{y_2}\sum_{i=1}^{n_2}y_i}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_1}\frac{x_j}{n_1}\right) + 2n_1\bar{y_1}^2 -2\bar{y_1}n_1\bar{y_1}}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_2}\frac{y_j}{n_2}\right) + 2n_2\bar{y_2}^2 -2\bar{y_2}n_2\bar{y_2}}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_1}\frac{x_j}{n_1}\right)}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_2}\frac{y_j}{n_2}\right)}{n_1 + n_2}\right)^{1/2} $$
Now the trick is to realize that we can reorder the sums: since each $$-2\bar{y}\sum_{j=1}^{n_1}\frac{x_j}{n_1}$$ term appears $n_1$ times, we can re-write the numerator as $$\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 -2\bar{y}x_i\right),$$
and hence, continuing with the equality chain: $$ = \left(\frac{\sum_{i=1}^{n_1}\left(x_i - \bar{y}\right)^2}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i - \bar{y}\right)^2}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1 + n_2}\left(z_i - \bar{y}\right)^2}{n_1 + n_2}\right)^{1/2} = s \qquad \square$$
This been said, there is probably a simpler way to do this.
The formula can be extended to $k$ subsets as stated before. The proof would be induction on the number of sets. The base case is already proven, and for the induction step you should apply a similar equality chain to the latter.
s from the standard deviations, means and sizes of two subsets. In the formula there is no reference to the individual observations. In the proof there is, but its just a proof, and from my point of view, correct.
– iipr
Mar 01 '18 at 10:19