2

If I want to improve the precision of the results of an assay by averaging a few repeat measurements of each sample, when the random error is generally normally distributed but occasional outliers are possible due to failures in the assay process that can't be identified for exclusion by any means other than the result, which average provides a better estimate, the mean or the median? (Keeping in mind that what matters is which is more representative of the actual concentration of the analyte, not necessarily the best central tendency of the numeric data)

Suppose I test the same blood sample three times with a glucometer. If the results are 160, 21, and 159 mg/dL, obviously the 21 is an outlier caused by some kind of experimental failure, and the median of 159 is a far better estimate of the concentration than the mean of 113. However, what if the results are 149, 130, and 147, or 149, 112, and 147? Both sets of results are within the range of ordinary random error of the glucometer.

My inclination is that in both cases, the median of 147 is the best estimate. The fact that two results are close and one is a multiple of of the distance between the other two from either of them indicates that the true concentration is likely to be in the vicinity of the two results that are close; and that between those two results, the one that's closer to the more distant result is a better estimate.

However, based on what I've read, medians have a ~25.3% higher standard error than means, and while this may not hold true for small sample sizes, it's generally close. So, by using the median, am I sacrificing precision for robustness?

For some applications, this may be a desirable trade-off. For example, if I'm using the averages to calibrate a continuous glucose monitor (CGM) from which I intend to draw aggregate data, a few poor calibrations could skew a substantial portion of the data, and the aggregate data will be more precise if I sacrifice a small amount of precision with each calibration to limit the influence of outliers.

However, for statistical analysis, such as calculating the MAE and MAPE of the CGM by periodically comparing it to concurrent averaged meter results from the same subject, would I achieve greater precision using the means of the triplicate glucometer measurements, or is my intuition correct that the median is generally a better estimate of the concentration (at least in cases where there is a substantial difference between median and mean), and that I'm better off using it to average the repeat measurements for all purposes? Would that change if I test each sample five or six times instead of three?

[As a side note, I do find experimentally that if I test the same sample four times, the fourth result is closer to the median than the mean of the first three slightly more often, and on average there is a smaller difference between fourth result and the median than the mean, in both absolute and relative terms. However, I'm not sure the data I have on this is statistically significant, as it's not based on a large amount of data and the differences aren't very large.]

Adi Inbar
  • 163
  • 1
    If the distribution of the measurements is Laplace (double exponential), then the median is a 'better' estimator of the center of the distribution than is the mean. // For modeling astronomical measurements some years ago there was a spirited debate whether the better model is Gaussian (normal) or Laplace, with Gauss and Laplace taking predictable positions which would be best. – BruceET Nov 18 '19 at 05:59

1 Answers1

1

Let's suppose that data are normally distributed, with no outliers that violate normality.

Then sample means of three observations will have a smaller variance than sample medians of three observations. Moreover, if you tried to 'improve' samples by always removing the observation that is farthest from the sample median, you would get an even larger variance than using sample medians.

For 3 observations that are distributed according to $\mathsf{Norm}(\mu=20, \sigma=2),$ the standard deviation of a sample mean is $SD(\bar X) = 2/\sqrt{3} = 1.155,$ the standard deviation of a sample median is $1.34,$ and the standard deviation of a 'closest 2 of 3' is $1.60.$ Any unwarranted 'outlier' removal gets you closer to the last and largest standard deviation.

Histograms based on a million samples are shown below:

enter image description here

You must be very cautious removing what seem to be 'outliers'. For your example with observations '160, 21, and 159," I suppose it is OK to remove 21 as an outlier. Not because it is too far from 159 and 160, but because (from what little I know of glucose measurements) I suppose that is simply not a possible value for a subject who can show up for a routine blood test.

It is OK to assume data entry error that for a basketball player who is listed as 10' 6" tall or to assume equipment failure for a printout of a Hawaiian temperature of $-60$ degrees F. But you should remove as outliers only values that are truly obviously wrong, not for ones that may seem a little strange.

R code: For simulated estimates shown in the histograms above.

# Means
set.seed(1117)
a = replicate(10^6, mean(rnorm(3,20,2)))
sd(a)
[1] 1.155163

# Medians
set.seed(1117)
h = replicate(10^6, median(rnorm(3,20,2)))
sd(h)
[1] 1.340838

# Avg closest two
set.seed(1117)
m = 10^6;  b = numeric(m)
for(i in 1:m) {
 x = sort(rnorm(3,20,2))
 b[i] = mean(x[2:3])
 d = diff(x)
 if(d[2]>d[1]) {b[i] = mean(x[1:2])}
}
sd(b)
[1] 1.598301

par(mfrow=c(1,3))
 hist(a, prob=T, br=25, xlim=c(13,27), col="skyblue2", main="Means")
 hist(h, prob=T, br=25, xlim=c(13,27), col="skyblue2", main="Medians")
 hist(b, prob=T, br=25, xlim=c(13,27), col="skyblue2", 
      main="Avg Closest Two")
par(mfrow=c(1,3))
BruceET
  • 56,185
  • You're right that a person with a 21 blood glucose would be symptomatic (probably barely conscious), making it more obvious the reading was invalid, but I meant just based on the numbers. Maybe a better example is 109, 238, 237 on consecutive tests of the same sample. I'm saying that in a case like that, I think it’s a safe bet that the true concentration is much closer to the median (237) than the mean (195), and if additional tests were done they’d be very likely to corroborate that. (Incidentally, that's a real example, where the sample was in fact tested a fourth time, and result was 242.) – Adi Inbar Nov 20 '19 at 19:46
  • To clarify, in no case was I considering removing the one that's farthest from the mean and averaging the ones that remain, not even in the case of an obvious outlier. My point in discussing cases where two readings are close together and the third is farther by a multiple was to make a case for the median being more representative. Notice that in all the examples I gave, I was talking about taking the median of all three, reasoning that of the two that are close together, the one that’s closer to the “odd one out” is more likely to be closer to the true value. – Adi Inbar Nov 20 '19 at 19:47
  • Even if statistically means have lower SD, for an assay than can be affected by occasional experimental failures in addition to purely random error inherent to the process, I strongly suspect that for a sample of 3-5 the median is more representative of the actual concentration, and if I had the resources to conduct thousands of sets of four repeat measurements of a single sample, the fourth would on average be significantly closer to the median than the mean of the first three. However, as long as that can’t be demonstrated experimentally or mathematically, that’s intuitive, not scientific. – Adi Inbar Nov 20 '19 at 21:29
  • …That’s why I posted this to medicalsciences rather than here, because I wasn’t sure it’s a purely statistical question, but I suppose this issue lies in the realm of statistics as well. If there are occasional outliers, not in the sense of being at the far reaches of the normal distribution of random measurement error, but due to some failure in the process that can’t be observed but causes these outliers to occur at a significantly higher rate than the normal distribution would predict, wouldn't that make the median a better average, at least potentially? – Adi Inbar Nov 20 '19 at 21:30
  • ...If so, does that mean that the question can only be settled experimentally, or is there a way to determine statistically whether the distributions are distorted enough to prefer the median? – Adi Inbar Nov 20 '19 at 21:32
  • Absent strong evidence of non-normality, I would use the mean (without automatic 'outlier' deletions). Mean is best for many other distributions occurring often in practice. Median is sub-optimal for normal, but maybe not catastrophically bad. Systematically deleting outliers without good cause is the worst choice. – BruceET Nov 20 '19 at 22:05
  • So it sounds like bottom line is that using the median does sacrifice precision for robustness, and the conclusion I'm drawing is that for a purpose that's outlier-sensitive like calibrating a CGM the median still makes more sense, if preventing outliers from throwing it way off course for half a day is more important than small differences in precision most of the time or maximizing the improvement in aggregate precision, but for number crunching such as calculating the MAPE, means, outliers included, would be more precise for anything even close to a normal distribution. – Adi Inbar Nov 20 '19 at 23:32
  • Does that actually mean that given some hypothetical triplicate measurements whose variation is only affected by normally distributed random error, for a result set of 199, 81, 191, the true value is more likely to be closer to the mean (157) than the median (191), and likewise even for 159,21,160, it’s probably closer to 113 than 159? I understand that if those sets are aggregated with a larger sample of triplicates, most of which would be more typical, calculations based on the means would be more precise, but is that true if you consider just those cases in isolation? – Adi Inbar Nov 20 '19 at 23:33
  • I don't see how to argue successfully from such a particular example. – BruceET Nov 20 '19 at 23:52
  • I'm not sure what you mean. What I'm saying is if you measure something three times and your results are 159, 21, and 160, is it statistically more probable that the true value is closer to 113 than 159...and if you measure a fourth time and get 157, then the true value is still probably closer to the mean (124) than the median (158)? (Never mind using that data in any other context, just what's the best estimate of the true value based on those results.) That just seems so counterintuitive to me that I want to make sure I'm understanding correctly. – Adi Inbar Nov 21 '19 at 00:46
  • I'm not sure you why you want to discuss a sample with 159, 21, 160. I already said it seems 21 may be an obvious mistake. // If the three values are 159, 167, and 160, I think it would be a mistake to disregard 167 as an outlier. My guess would be that the true patient value is nearer to 162 than to 160. Of course, I can't prove that. But you're on weaker ground arguing for 160. – BruceET Nov 21 '19 at 01:39
  • That was meant to be hypothetical, not in reference to the glucometer, in a case where there is no possibility of experimental failure and only normally distributed random error causes variability, in order to make sure I'm understanding a point of statistics that seems very counterintuitive to me. Maybe the fact that I repeated numbers from my post confused the issue, so let me rephrase: – Adi Inbar Nov 21 '19 at 19:12
  • Given triplicate measurements with some hypothetical device whose results are only affected by normally distributed random error, if your results are 38.7, 12.4, 38.1, is it statistically more probable that the true value is closer to the mean, 29.7 than the median, 38.1? if a fourth measurement is 38.3, would it then be more likely that it's closer to the new mean, 31.9 than the new median, 38.2? The fact that you have multiple very close measurements and one that appears to be a statistical anomaly (produced purely by chance) wouldn't matter? – Adi Inbar Nov 21 '19 at 19:22
  • Or, in your example, if the middle value were 267 instead of 167, would there still be a stronger case for the mean (195) than the median (160), based purely on how normal distributions work, without regard to the possibility something going wrong in the measurement process? – Adi Inbar Nov 21 '19 at 19:28
  • I'm being prompted to move this to chat. I think I'll just post a new question. Although this does pertain to what I was driving at in this question, maybe it makes more sense to split this specific aspect of it into a new question. – Adi Inbar Nov 21 '19 at 19:39
  • Never mind, no need to ask a new question. Playing around with a probability-from-z-score calculator for a while, I can see that the combined probability of any three values, for any given population SD, is lower if you set the population mean to their median than to their mean, which indicates that the set of measurements is less likely to occur if the true value is the median than if it's the mean. In fact, the more anomalous the outlier, if it's produced purely by normally distributed random error, the lower the probability is for the median than the mean. Thanks for all your help! – Adi Inbar Nov 23 '19 at 05:38