20

Is the following formula right if I want to measure the standard error of the median in case of a small sample with non normal distribution (I'm using python)?

 sigma=np.std(data)
 n=len(data)
 sigma_median=1.253*sigma/np.sqrt(n)
mary
  • 213
  • 5
    As of this date, more than nine years later, a fully correct answer has not been posted: all of them, although useful (and +1 to many of them), implicitly assume your "non normal distribution" is continuous in a neighborhood of its median. To appreciate the problem, consider what the SE of the sample median would be a for a Bernoulli variable. For a collection of more careful analyses see the (earlier) thread at https://stats.stackexchange.com/questions/45124/. – whuber Aug 25 '22 at 18:49

5 Answers5

25

The magic number 1.253 comes from the asymptotic variance formula: $$ {\rm As. Var.}[\hat m] = \frac1{4f(m)^2 n} $$ where $m$ is the true median, and $f(m)$ is the true density at that point. The magic number 1.253 is $\sqrt{\pi/2}$ from the normal distribution so... you still are assuming normality with that.

For any distribution other than the normal (and mary admits that this is doubtful in her data), you would have a different factor. If you had a Laplace/double exponential distribution, the density at the median is $1/2b$ and the variance is $2b^2$, so the factor should be $1/\sqrt{2} = 1.414$ -- the median is the maximum likelihood estimate of the shift parameter, and is more efficient than the mean. So you can start picking your magic numbers in different ways...

Getting the median estimate $\hat m$ is not such a big deal, although you can start agonizing about the middle values for the even number of observations vs. inverting the cdf or something like that. More importantly, the relevant density value can be estimated by kernel density estimators, if needed. Overall, this of course is relatively dubious as three approximations are being taken:

  1. That the asymptotic formula for variance works for the small sample;
  2. That the estimated median is close enough to the true median;
  3. That the kernel density estimator gives an accurate value.

The lower the sample size, the more dubious it gets.

StasK
  • 31,547
  • 2
  • 92
  • 179
  • 10
    Perhaps worth adding that the magic number is $\sqrt{\dfrac{\pi}{2}} \approx 1.253314$ – Henry Feb 25 '18 at 23:27
20

Based on some of @mary's comments I think the following is appropriate. She seems to be selecting the median because the sample is small.

If you were selecting median because it's a small sample that's not a good justification. You select median because the median is an important value. It says something different from the mean. You might also select it for some statistical calculations because it's robust against certain problems like outliers or skew. However, small sample size isn't one of those problems it's robust against. For example, when sample size gets smaller it's actually much more sensitive to skew than the mean.

John
  • 22,838
  • 9
  • 53
  • 87
  • Thanks John! Actually I chose to use the median in place of the mean for the reason you have just written. I've different samples, all of them having non gaussian distribution. There are sample containing more than 50 point, others containing less than 10 points, but for all of them I think your comment is valid, isn't it? – mary May 23 '13 at 16:17
  • With so few points I'm not sure what you can say about the underlying distribution. If you're comparing samples containing less than 10 with samples containing 50 and the underlying distribution isn't symmetric a median will show an effect even if there isn't one because it will have more bias in the small sample than the large. The mean won't. – John May 23 '13 at 18:54
  • In the future flesh out your questions better and ask more about what you really need to know. Say why you've done what you've done so far and describe the data that you have well. You'll get much better answers. – John May 23 '13 at 18:55
  • 2
    "small sample size isn't one of those problems it's robust against" is worth a +1 on its own; the rest is a bonus – Glen_b May 24 '13 at 03:59
  • 1
    As a matter of fact, Huber makes a point in his book that there is no single concept of robustness. There is robustness to outliers (and that's what the median is robust for). Another view, however, is robustness to the measurement error, and that's what the mean is robust for, as it averages these measurement errors. The median, however, is highly susceptible to measurement error fluctuations as they may affect the middle of the distribution just as badly as the tails. – StasK Jun 14 '13 at 15:08
  • If the median is more robust, why its SE is larger than the mean? – Albert Chen Oct 19 '20 at 19:20
17

Sokal and Rohlf give this formula in their book Biometry (page 139). Under "Comments on applicability" they write: Large samples from normal populations. Thus, I am afraid that the answer to your question is no. See also here.

One way to obtain the standard error and confidence intervals for the median in small samples with non-normal distributions would be bootstrapping. This post provides links to Python packages for bootstrapping.

Warning

@whuber pointed out that bootstrapping the median in small samples isn't very informative as the justifications of the bootstrap are asymptotic (see comments below).

COOLSerdash
  • 30,198
  • thanks for your answer! I know that bootstrapping would be an alternative, I was just guessing if there is a way to measure the error of the median in a different way. Is the answer no also for the standard error on the MEAN (same small non gaussian sample)? – mary May 23 '13 at 14:58
  • @mary For the standard error of the mean, Sokal and Rohl write that it is applicable for "[...] any population with finite variance." So the answer for the standard error of the mean seems to be yes, you can calculate it. Sidenote: There are distributions though (e.g. the Cauchy distribution) that don't have a defined variance or mean and in such cases, the SEM cannot be calculated. – COOLSerdash May 23 '13 at 15:00
  • 9
    (+1) Unfortunately, bootstrapping the median of a small sample won't be very informative, either--and is unnecessary, because it can be replaced by a simple calculation. (For any number $t$, ask yourself, what are the chances that more than half of a bootstrap sample will exceed $t$? That answer is easy to come by, and now you don't need to run any simulations to estimate it.) – whuber May 23 '13 at 15:04
  • @whuber Thanks for your comment. That's good to know. I deleted the advice to bootstrap the median in small samples from my answer. – COOLSerdash May 23 '13 at 15:06
  • 2
    I wasn't trying to suggest it's bad advice: I only wanted to point out its (unavoidable) limitations. Learning much from small samples is hard. But bootstrapping small samples is doubly fraught, because there is no theoretical justification to support it (all the justification is asymptotic). – whuber May 23 '13 at 15:10
  • @whuber So what is the best thing to do in case of small samples? Is it better to measure the mean and the standard error of it instead of the median? – mary May 23 '13 at 15:16
  • @mary Let's say you took a survey of 100 people in a small city to find their salaries, and 99 have salaries of $20k per yr or less and one makes $50m per yr. If trying to determine what salary the common man in town has so that you could determine whether to market the $1 coffees or $5 lattes, you'd likely use median. If trying to determine how much the average person makes to multiply by the known population to estimate tax revenue, you might use mean. If your sample size is too small, the difference in std error between each may not matter, as the result of either choice may mislead. – Gary S. Weaver May 23 '13 at 17:51
0

Not a solution here, but perhaps helpful:

Suppose your data distribution is $p(x)$, and let $P(x) = \int_{-\infty}^x p$ be the cumulative density function. So the median of the distribution is the number m such that P(m) = 1/2.

Following this helpful page we can compute the distribution of a number $x$ being the median of $n$ samples of this distribution. I think it is $q(x) = c_n p(x) (P(x)(1-P(x)))^{(n-1)/2}$. Here $c_n$ is the appropriate constant to make this a probability distribution, and I think it is n-1 choose (n-1)/2 if n is odd (unsure on that).

Finally, you would like to know the variance of q(x), which you may be able to reason about with this formula.

travelingbones
  • 441
  • 2
  • 12
0

There is an empirical procedure for obtaining a confidence interval for the sample median. The procedure is non-parametric and relies on the binomial distribution. It can be found in Ott and Longnecker, 2015 in the section named ‘Inferences about the median’. Stata implements the procedure as the ‘centiles’ command and the Stata doc provides a mathematical justification with references.

Here is the procedure for a 95% CI and a python script. The standard error of the median is determined from the CI. The results are the same as the results from the Stata 'centiles' command.

import numpy as np
from scipy.stats import binom
# n = 25
data = [1.1, 1.2, 2.1, 2.6, 2.7, 2.9, 3.6, 3.9, 4.2, 4.3, 4.5, 4.7, 5.3,
        5.6, 5.8, 6.5, 6.7, 6.7, 7.8, 7.8, 14.2, 25.9, 29.5, 34.8, 43.8]
median = np.median(data)
print(f'median: {median}')
# the distribution
n = 25
p = .5
rv = binom(n, p)
# 95% critical value
q = .05
binom_critical = rv.ppf(q=q)
print(f'binom 95% critical value: {binom_critical}')
# the 95% CI for the median
L_q = int(binom_critical)
U_q = int(n - binom_critical)
print(f'L_q: {L_q} U_q: {U_q}')
lower_ci = data[L_q - 1]
upper_ci = data[U_q - 1]
print(f'lower_ci: {lower_ci} upper_ci: {upper_ci}')
median: 5.3
binom 95% critical value: 8.0
L_q: 8 U_q: 17
lower_ci: 3.9 upper_ci: 6.7

The standard error is CI/2. In this case 6.7 – 3.9 / 2 = 1.4. This is analogous to the normal case, where if given the CI for the mean, the standard error is calculated as:

enter image description here

For the non-parametric method for the median, there is no t-statistic because the confidence level is embodied in the critical values for the binomial distribution.

tmck
  • 21
  • 4
    The question is about the standard error of the median, not confidence intervals. If you could edit it to address that, this would be a useful answer. – mkt Sep 11 '22 at 12:26
  • 2
    This looks like the procedure described in my post at https://stats.stackexchange.com/a/284970/919. There might be ad hoc ways to convert it into some kind of equivalent standard error, but that would be a rough approximation for any small sample. – whuber Sep 17 '22 at 12:53