Is it possible to find the combined standard deviation?

Question

Suppose I have 2 sets:

Set A: number of items $n= 10$, $\mu = 2.4$ , $\sigma = 0.8$

Set B: number of items $n= 5$, $\mu = 2$, $\sigma = 1.2$

I can find the combined mean ($\mu$) easily, but how am I supposed to find the combined standard deviation?

https://en.wikipedia.org/wiki/Pooled_variance#Pooled_standard_deviation — Chris, Jan 11 '18 at 21:08
wait wait.... Can't you just sum the variances? Look at this!: https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables — Tony, Jan 08 '20 at 02:59

score 39 · Accepted Answer · edited Apr 27 '16 at 11:13

39

So, if you just want to have two of these samples brought together into one you have:

$s_1 = \sqrt{\frac{1}{n_1}\Sigma_{i = 1}^{n_1} (x_i - \bar{y}_1)^2}$

$s_2 = \sqrt{\frac{1}{n_2}\Sigma_{i = 1}^{n_2} (y_i - \bar{y}_2)^2}$

where $\bar{y}_1$ and $\bar{y}_2$ are sample means and $s_1$ and $s_2$ are sample standard deviations.

To add them up you have:

$s = \sqrt{\frac{1}{n_1 + n_2}\Sigma_{i = 1}^{n_1 + n_2} (z_i - \bar{y})^2}$

which is not that straightforward since the new mean $\bar{y}$ is different from $\bar{y}_1$ and $\bar{y}_2$:

$\bar{y} = \frac{1}{n_1 + n_2}\Sigma_{i = 1}^{n_1 + n_2} z_i = \frac{n_1 \bar{y}_1 + n_2 \bar{y}_2}{n_1 + n_2}$

The final formula is:

$s = \sqrt{\frac{n_1 s_1^2 + n_2 s_2^2+ n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1 + n_2 }}$

For the commonly-used Bessel-corrected ("$n-1$-denominator") version of standard deviation, the results for the means are as before, but

$s = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2 + n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1+n_2 - 1} }$

You can read more info here: http://en.wikipedia.org/wiki/Standard_deviation

edited Apr 27 '16 at 11:13

Glen_b

282,281

answered Apr 13 '13 at 09:07

sashkello

2,244

1

If the OP is using the Bessel-corrected ($n-1$-denominator for the variance) version of sample standard deviation (as almost everyone who asks here will be doing), this answer won't quite give them what they seek. – Glen_b Oct 11 '14 at 05:44
1

In that case, this section does the trick. (edit to link to old wikipedia version since it's removed from the new one) – Glen_b Oct 11 '14 at 05:50
@Glen_b Good catch. Can you edit this into the answer to make it more useful then? – sashkello Oct 11 '14 at 06:37
I went to Wikipedia to find the proof, but unfortunately this formula is no longer there. Care to elaborate (the proof) or improve Wikipedia? :) – Rauni Lillemets Aug 10 '17 at 10:49
@RauniLillemets see https://en.wikipedia.org/wiki/Pooled_variance#Pooled_standard_deviation – Chris Jan 11 '18 at 21:09
see https://stackoverflow.com/q/9222056/4634775 – Chris Jan 11 '18 at 21:09
1

If it is of any use, I have verified that this works using java. – Ares Mar 21 '19 at 10:00
Bessel-corrected version works great. Tested in Python. – Pandian Le Oct 11 '22 at 18:37

score 13 · Answer 2 · edited Sep 06 '16 at 20:14

13

This obviously extends to $K$ groups:

$$ s = \sqrt{ \frac{\sum_{k=1}^K (n_k-1)s_k^2 + n_k(\bar{y}_k-\bar{y})^2} {(\sum_{k=1}^K n_k) -1} }$$

edited Sep 06 '16 at 20:14

gung - Reinstate Monica

145,122

answered Sep 06 '16 at 17:38

Ravi Varadhan

131

8

This is a bit brief by out standards. Could you say a bit more about how this is derived and why this is the correct answer? – Sycorax Sep 06 '16 at 18:52

score 2 · Answer 3 · answered Feb 28 '18 at 17:28

I had the same problem: having the standard deviation, means and sizes of several subsets with empty intersection, compute the standard deviation of the union of those subsets.

I like the answer of sashkello and Glen_b ♦, but I wanted to find a proof of it. I did it in this way, and I leave it here in case it is of help for anybody.

So the aim is to see that indeed: $$s = \left(\frac{n_1 s_1^2 + n_2 s_2^2+ n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1 + n_2 }\right)^{1/2}$$

Step by step: $$\left(\frac{n_1 s_1^2 + n_2 s_2^2+ n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1 + n_2 }\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}(x_i - \bar{y_1})^2 + \sum_{i=1}^{n_2}(y_i - \bar{y_2})^2+ n_1(\bar{y}_1-\bar{y})^2 +n_2(\bar{y}_2-\bar{y})^2}{n_1 + n_2 }\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left((x_i - \bar{y_1})^2 + (\bar{y}_1-\bar{y})^2\right) + \sum_{i=1}^{n_2}\left((y_i - \bar{y_2})^2 + (\bar{y}_2-\bar{y})^2\right)}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 + 2\bar{y_1}^2 -2x_i\bar{y_1} -2\bar{y_1}\bar{y} \right)}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i^2 + \bar{y}^2 + 2\bar{y_2}^2 -2y_i\bar{y_2} -2\bar{y_2}\bar{y} \right)}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_1}\frac{x_j}{n_1}\right) + 2n_1\bar{y_1}^2 -2\bar{y_1}\sum_{i=1}^{n_1}x_i}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_2}\frac{y_j}{n_2}\right) + 2n_2\bar{y_2}^2 -2\bar{y_2}\sum_{i=1}^{n_2}y_i}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_1}\frac{x_j}{n_1}\right) + 2n_1\bar{y_1}^2 -2\bar{y_1}n_1\bar{y_1}}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_2}\frac{y_j}{n_2}\right) + 2n_2\bar{y_2}^2 -2\bar{y_2}n_2\bar{y_2}}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_1}\frac{x_j}{n_1}\right)}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i^2 + \bar{y}^2 -2\bar{y}\sum_{j=1}^{n_2}\frac{y_j}{n_2}\right)}{n_1 + n_2}\right)^{1/2} $$

Now the trick is to realize that we can reorder the sums: since each $$-2\bar{y}\sum_{j=1}^{n_1}\frac{x_j}{n_1}$$ term appears $n_1$ times, we can re-write the numerator as $$\sum_{i=1}^{n_1}\left(x_i^2 + \bar{y}^2 -2\bar{y}x_i\right),$$

and hence, continuing with the equality chain: $$ = \left(\frac{\sum_{i=1}^{n_1}\left(x_i - \bar{y}\right)^2}{n_1 + n_2} + \frac{\sum_{i=1}^{n_2}\left(y_i - \bar{y}\right)^2}{n_1 + n_2}\right)^{1/2} = \left(\frac{\sum_{i=1}^{n_1 + n_2}\left(z_i - \bar{y}\right)^2}{n_1 + n_2}\right)^{1/2} = s \qquad \square$$

This been said, there is probably a simpler way to do this.

The formula can be extended to $k$ subsets as stated before. The proof would be induction on the number of sets. The base case is already proven, and for the induction step you should apply a similar equality chain to the latter.

I don't see how the question is clear. Are the two data sets assumed to come from the same distribution? Does the OP have the actual observations available or just the sample estimates of mean and standard deviation? — Michael R. Chernick, Feb 28 '18 at 17:54
Yes they are assumed to come from the same distribution. Observations are not available, just the mean and standard deviation of the subsets. — iipr, Feb 28 '18 at 18:11
Then why are using a formula that involves the individual observations? — Michael R. Chernick, Feb 28 '18 at 18:21
Maybe my answer is not clear. I am simply posting a mathematical proof of the above formula that allows to compute s from the standard deviations, means and sizes of two subsets. In the formula there is no reference to the individual observations. In the proof there is, but its just a proof, and from my point of view, correct. — iipr, Mar 01 '18 at 10:19
I think there's a mistake in the first formula, it should be n_1-1 and n_2-1 in the numerator and another 1 should be subtracted from the denominator, as per @Ravi answer. — savenkov, Mar 15 '20 at 02:20

Is it possible to find the combined standard deviation?

3 Answers3

Linked

Related