4

I have a data set that has been divided into $n$ data subsets.

I am sampling from each of these subsets and getting a tuple consisting of mean, variance, confidence and number of sampled points used.

How can I combine these results?

I do not know how other than a simple function of the number of points and their averages. This wont take into account either the variance or the confidence of the score.

Jeromy Anglim
  • 44,984
Budhapest
  • 581

1 Answers1

5

Let $n_i, m_i, v_i$ be the number of samples, observed mean, and variance in sample $i$. Let $n, m, v$ be similar figures for the combined data (sorry I redefined $n$ here).

$$m = \frac{1}{n}\sum_i n_i m_i$$.

Now for the variance:

$$v = \frac{1}{n-1}\sum_{i,j} (x_{i,j} - m)^2$$

with $x_{i,j}$ the $j^{th}$ observation of sample $i$ and $j=1,2,\ldots, n_i$.

Play around a little:

$$(x_{i,j} -m)^2 = (x_{i,j} - m_i + m_i - m)^2 = (x_{i,j} -m_i)^2 + (m_i-m)^2 +2(x_{i,j}-m_i)(m_i-m)$$.

Terms $(m_i-m)$ can be factored out of the summation over $j$:

$$v = \frac{1}{n-1}\left[\sum_i n_i(m_i-m)^2 + 2\sum_i(m_i-m)\sum_j(x_{i,j}-m_i) + \sum_{i,j} (x_{i,j} - m_i)^2\right]$$.

Since $\sum_j (x_{i,j}-m_i)=0$, the middle term cancels out. So you're left with:

$$v=\frac{1}{n-1}\left[\sum_i n_i(m_i-m)^2 + \sum_i(n_i-1)v_i\right]$$

Confidence intervals are obtained with $m$ and $v$. Is that what you were looking for ?

n1k31t4
  • 551
Chap
  • 171
  • It is close but not exactly. The mean is obtained in your answer by just taking the combined sum and dividing by the combined total. I wanted to take into consideration the variance and/or confidence of the individual means while combining them. I am going to continuously draw more and more samples from the different distributed datasets and want to take into consideration the variance in a dataset so as to weight it lesser – Budhapest Jun 15 '12 at 02:52
  • Ok, well I'm not sure exactly what the objective is here but I guess you could assign a weight such as $w_i=1/v_i$ and compute the weighted average $m = \sum_i{w_i m_i}/\sum_i w_i$. – Chap Jun 15 '12 at 03:06
  • That would be heuristic. I was looking for some mathematical bounds on the error – Budhapest Jun 15 '12 at 03:09
  • A bit more detailed context would help: what do the observations represent ? What is the confidence of the score ? Where does the error come from ? – Chap Jun 15 '12 at 03:23
  • I have say a dataset. I then divided into 5 datasets. I am going to continuously draw samples from these and get the mean and variance of each of the sample. At every few hundred milliseconds I want to combine the means that I have using the variance of each of the subsets. I want the answer with 95% confidence which would give me a certain confidence interval. So at every merge step I want to be able to say that the mean lies within this interval with this confidence of this error. – Budhapest Jun 15 '12 at 03:28
  • I get confused by the fact that you divide the dataset into subsets. For what purpose, parallelisation ? If they come from the same original data generating process, then I don't see any justification for parting from the 1st answer. – Chap Jun 15 '12 at 03:37
  • Yeah I have to divide the dataset for parallelization. Each slave node will give the mean error, etc tuple and then the master has to combine the results – Budhapest Jun 15 '12 at 03:48