7

Suppose we have the median incomes of 150 different countries (in each country, the median income was calculated on a sample of people) - we don't have the individual incomes of each person in these countries, but rather, we are only presented with the median income for each country. However, we have the total population of each country and how many samples were used to calculate the median income of each country.

In this situation, suppose I am interested in finding out the variability of the median incomes. The first thing that comes to mind is to use the Bootstrap Method. For example:

  • Take a random sample (with replacement) of size=150 from the original data
  • Calculate the median income from this random sample
  • Repeat the above two steps many times - now you will have an empirical distribution of sample medians

My Question: I am confused what to do from this point. From here, you can either take the mean of all these medians - or the median of all these medians. You can also identify different percentiles to place a bound on your estimates (e.g. 95th Confidence Interval on Median of Medians, 95th Confidence Interval on Mean of Medians). Are both of these approaches mathematically valid?

As I understand, the Bootstrap Method uses the Law of Large Numbers to show that the average result of finite experiments "converge in probability" to the average of the actual phenomena that these experiments are studying. The Law of Large Numbers does not make any similar claims about the Median.

So in this case, it appears that the above Bootstrap procedure I described will be calculating and placing bounds on the "Average Median" - and not on the Median of the Medians. I do not think it would be correct to take the median of all bootstrapped medians and then calculate the Confidence Interval around the "Median of the Medians". I think it would make more sense to take the mean of all bootstrapped medians and calculate the Confidence Interval around the "Mean of the Medians".

Is my understanding correct?

Note: In such a situation, we are limited in the conclusions that we can make from this data. Using the bootstrap method, we can not make any reliable conclusions on the median income of individual people in these countries - nor can we make conclusions about the individual countries (i.e. ecological fallacy). Rather, we can only make conclusions on the overall median income of all countries .

stats_noob
  • 1
  • 3
  • 32
  • 105
  • Bootstrapping the median is merely a computationally expensive method of approximating the exact results given at https://stats.stackexchange.com/questions/122001. – whuber Aug 08 '23 at 16:21
  • @whuber : thank you for your reply! If I had the raw data, there would be no need to bootstrap the median because the exact formula is provided. However, in my example, I am only provided with a set of medians (i.e. median income from different countries) without the raw data. this is why I was looking into bootstrapping. – stats_noob Aug 08 '23 at 16:26
  • Usually when one "has" a median it has been estimated from a dataset. But I will grant that someone else might have collected the data and not disclosed the details. Nevertheless, it is hard to see how one could consider analyzing any set of medians without making some strong assumptions and having, at a minimum, information about the sizes of the datasets. Also, if you really are interested only in the "variability of the median incomes," what's the matter with simply quantifying that? What do you hope a bootstrapping procedure would reveal? How would you interpret its results? – whuber Aug 08 '23 at 16:47
  • @whuber: you are correct - only the median incomes have been provided to me, along with the sample sizes and population size used to calculate each median. I want to find out the variance in the medians - but I am not sure how well the median of medians estimator is documented (e.g. weighted median). This is why I wanted to bootstrap. – stats_noob Aug 08 '23 at 23:26
  • Suppose you could somehow acquire the standard deviation of each sample median. Could you somehow construct something similar to an "inverse variance weighted median" estimator? – stats_noob Aug 08 '23 at 23:29
  • @Whuber: I have posted a follow up question about this: https://stats.stackexchange.com/questions/623548/does-the-weighted-median-exist – stats_noob Aug 09 '23 at 00:22
  • @whuber Regarding your result linked in the first comment, wouldn't it be conceivable that the bootstrap can improve upon the nonparametric result in case one doesn't want to make a parametric assumption? – Christian Hennig Aug 09 '23 at 00:33
  • @Christian How, exactly? It's difficult even to see how one might even begin bootstrapping, given that these medians all have different sampling distributions. – whuber Aug 09 '23 at 13:50
  • @whuber I was referring to the general problem of obtaining a CI for the median as discussed in that question. My question wasn't about the country-wise medians here. – Christian Hennig Aug 10 '23 at 02:04
  • @Christian Okay. But I believe the bootstrap of a simple random sample (with replacement) will merely recover the nonparametric CI. – whuber Aug 10 '23 at 13:21

2 Answers2

10

By sampling from the data, bootstrap tries to mimic how you sampled the data from the population. By repeating the procedure, you simulate the distribution of different possible samples could you have taken from the population. The same as you can take the actual data $\mathbf{x}$ and calculate some statistic or use an estimator $T$ using the data $\hat \theta = T(\mathbf{x})$, you could do it with the bootstrap sample $\mathbf{x}^*$ to get $\hat \theta^* = T(\mathbf{x}^*)$. So yes, it can be applied to nearly anything. As a result, you obtain a sample from the distribution of the estimates $\boldsymbol{\hat\theta}^* = \big\{ \hat\theta^*_1, \hat\theta^*_2, \dots, \hat\theta^*_R \big\}$. Now, you can use different statistics (mean, median, standard deviation, quantiles, histogram, etc) to summarize or visualize the distribution.

Usually, you would be using bootstrap to estimate the uncertainty due to sampling of the estimate $\hat \theta$, so you would be calculating things like quantiles or variance. The mean or median of the bootstrap distribution would not be very interesting, because the distribution would be centered at the original estimate $\hat\theta$, but perturbed due to random sampling. The confidence interval is around the original estimate $\hat\theta$, not around the mean or median of the bootstrap distribution.

So in this case, it appears that the above Bootstrap procedure I described will be calculating and placing bounds on the "Average Median" - and not on the Median of the Medians. I do not think it would be correct to take the median of all bootstrapped medians and then calculate the Confidence Interval around the "Median of the Medians".

What you would usually do is you would take quantiles of the bootstrap distribution, i.e. the 95% confidence interval around the median calculated on the raw data $\hat\theta$ would be $\big[Q_{0.025}(\boldsymbol{\hat\theta}^*),\, Q_{0.975}(\boldsymbol{\hat\theta}^*)\big]$ where the $Q_{q}$ is the $q$-th empirical quantile of the bootstrap distribution.

As @Silverfish noticed below, bootstrap however would not help you to find the “variability of medians” but the variability of some statistic due to sampling. If your data is medians and you want to find the variability of it, just calculate the variance of the medians.

† - see comment by @whuber below.

Tim
  • 138,066
  • (+1) One more point in the original post which I think is worth addressing: "suppose I am interested in finding out the variability of the median incomes". It seems to me the OP would benefit from a brief summary of what you'd do if you really were interested in "variability" in the median rather than "uncertainty" in the median, and/or a clarification that the bootstrap 95% CI for the median is to do with uncertainty rather than variability – Silverfish Aug 08 '23 at 15:36
  • 2
    Bootstrapping cannot validly "be applied to anything." It has known limitations that are described in the literature (as in Efron's work, for instance). In particular, it applies only to statistics that are functionals of a distribution and, in most settings, is useless for studying extremes of a distribution. Its justification also is asymptotic, so caution is required when applying it to "small" datasets. – whuber Aug 08 '23 at 16:24
5

I don't see this has already be said in a straight way, so I say it. In the explained situation you cannot use the bootstrap to say anything about the variability of the median of the individual countries, because this variability is governed by the variability of observations within a country, which you don't have.

Any bootstrap you can do will be informative only about the variability of the statistic your bootstrap over all countries, so if you bootstrap the median, it will tell you about the variability of the estimator of the overall median, not of the within-country medians in which you are apparently interested (actually you may already understand this, I can't tell from what you wrote, as there seems to be some confusion about the fact that the given data are medians and one could also be interested in the median of the data, i.e., the median of the medians).

Note that if you are indeed interested in variability of an overall statistic of the observations (which are medians but this isn't really relevant to the bootstrap here), you can choose whether you want to consider the mean or median. From a statistical point of view it can't be said that one of these would be correct and the other one wrong (although knowing data and background there could be arguments in either direction). @whuber's comment holds that there may be analytical results that don't require the bootstrap. Also it can be doubted that the countries constitute a valid random sample of a well defined population.

  • thank you for your answer! in this question I posted - the overall median is the same as the median of the median. bootstrapping can only tell you the variability of the overall median (i.e. median of the median income of all countries) .... and NOT about individual countries or people in the individuals countries. is what I have written correct? thank you so much! – stats_noob Aug 09 '23 at 04:03
  • Yes that's how it is. – Christian Hennig Aug 10 '23 at 02:02
  • thank you so much for your clarifications. I am working on a question here where I think the weighted bootstrap might be applicable. Can you please take a look at it here? https://stats.stackexchange.com/questions/623548/does-the-weighted-median-exist thank you so much! – stats_noob Aug 10 '23 at 13:26