At what value of mean and variance should I throw data away?

Question

I have some score values that are output from a program. There are about 10 such values. The data set is a measure of the "quality" of a speech waveform received over a mobile phone and landline channel. The waveform is passed through an algorithm and a score is received of its quality relative to a "golden" waveform (that gets a 100). My task is to make adjustments to the algorithm that bring the scores from the mobile and landline channels closer. I am afraid that is the most detail I can give out about the task. Please find below some of these scores:

Mobile: 52 66 69 54 88  
Landline: 60 57 72 49 75

When I compute the mean and variance of this small data-set, I get very high variance (expected from a small data-set). My question is:

Are data-sets with very high variances usually rejected?
If so, is there a relation between the number of elements in the data-set, its mean and variance such that I can take a look and say "Ahhh... the variance is too high in that one (according to some relation that I do not know of), I must reject(?) this data".

P.S: Please let me know in case my question does not make sense. I will try to elaborate.

Please add to your question the actual (small) data set you're talking about and a description of where it came from, what it is, and what you wanted to do with it. I ask for this because the answers to your questions are very context dependent. — John, Jul 26 '11 at 12:52
@John: I have added more information. Please take a look at it. — Sriram, Jul 26 '11 at 13:04
Thanks for the edit. Just one more question... what would you do if you rejected this data? Would your job be over? — John, Jul 26 '11 at 13:15
@John: No, I think in that case my job would just about begin! ;) I was playing around with the data and thought if I should place any "confidence" in data with such high variance. Hence the question. — Sriram, Jul 26 '11 at 13:38
Where might your lack of confidence be coming from? Do you believe the device generating the data is defective? — John, Jul 26 '11 at 13:52

score 10 · Accepted Answer · answered Jul 26 '11 at 13:50

Let me clear up some misconceptions first. The estimate of your population variance is not high because your sample is small. In fact, just the opposite is often the case, the variance tends to be low because small samples over represent the peak of the distribution. The variances of larger samples are more representative and more accurate. And, as a corollary, small samples are less accurate and have a higher sampling error, usually measured as standard error.

Data are generally not discarded merely because they have something that might be judged as high variance. The variance is considered a property of the data that you need to discover and you attempt to get enough data to get a reasonable determination of that. Looking at your data these don't seem like very high variances at all. You can actually get very useful information out of data like that but you will need more of it.

If you throw away these data and just get another small sample and it has lower variance that doesn't tell you that the underlying distribution has lower variance, it's just sampling variability. Therefore, don't do that. Just keep collecting more data and noting properties of it like time of day. If it's relatively consistently noisy over a period of time you might just be able to average it all together and get great distributions of the two different kinds of signal.

Clearly there is overlap in your distributions and you're going to have to take some time to get this working correctly. You'll need to collect lots of samples of each of your different signals to see if your manipulations in the algorithm have effects. There's enough noise that it would be easy to fool yourself into thinking you had succeeded in solving the problem if you just throw away samples you don't like. There's also enough noise that you might not have much of a problem and continue to think you're failing if you throw away samples you don't like.

In short, keep all of the data and get more of it. Work out the distributions. Make adjustments to your algorithm. Collect more data. Repeat until you've solved the problem.

When you do get more data and you've tried a couple of algorithms come back and ask for help on exactly how to model your data so that you can make decisions about that algorithms to keep and what to reject. At that time you might post more summary type statistics with your question like means, variances, and perhaps a histogram.

You might want to also ask for help from a cognitive psychologist who specializes in applied issues. There will be some tradeoff between mean QoS and the variance that you may be better off minimizing the variance even if it drops the mean. But I'm betting that should have been done by people other than you.

+1 for good advice. But what's the motivation for referring a question about signal processing/engineering/computation to a "cognitive psychologist"?! — whuber, Jul 26 '11 at 14:08
@whuber: whatever gets me more insights into the problem is most welcome. the fact that this question finds itself here is thanks in no small measure to an 'army' of do-gooders on SE forums who may be a tad too over-zealous in their classification. — Sriram, Jul 26 '11 at 14:13
@John: Thanks for the reply. I will go through it and get back. — Sriram, Jul 26 '11 at 14:14
Whuber, he's attempting to equate QoS of landline and cell phone calls. Likely this is so that people making calls perceive the mobile call to be of equal quality. It may not be possible or be prohibitively costly to get veridical overlap. I'm guessing that there are issues in the perception of signal quality that would be critical for cost-benefit analysis and the differential impacts of things like mean QoS and variance. — John, Jul 26 '11 at 14:28
@John: Thanks for more insights into the problem! I'll work on what you said. — Sriram, Jul 28 '11 at 05:34

At what value of mean and variance should I throw data away?

1 Answers1

Related