1

The Wikipedia page about quantiles describes two approaches to the definition of quantiles: population quantiles, and sample quantiles. The section on sample quantiles lists nine different flavors of sample quantiles together with the formulas used to calculate them.

Are the various sample quantile formulas consistent with the definition of population quantiles?

I'll explain what I mean by that. According to the definition of population quantiles, given a distribution and a number $0 < p < 1$ there may be an entire interval $I \subseteq \mathbb{R}$ each of whose members is a $p$-quantile of the distribution.

For example, consider the finite uniform distribution on the set $\{1,2,3,4\}$, which assigns each of these four numbers the probability $\frac{1}{4}$. Call this distribution $P$. Then any number in the closed interval $[2,3]$ is a $\frac{1}{2}$-quantile (aka a median) of this distribution, per the definition of population quantiles, because the numbers $x \in [2,3]$ are precisely those numbers satisfying

$$ P\big((-\infty,x)\big)\leq \frac{1}{2}, $$

$$ P\big((-\infty,x]\big)\geq \frac{1}{2}. $$

My question is, then, whether each of the nine numbers calculated as $p$-sample quantiles always belongs to interval of $p$-population quantiles.

In other words, does each of the nine sample quantile formulas return a value from the corresponding population quantile interval?

* I'm mostly interested in distributions with finite support (such as the one in the example above).

User1865345
  • 8,202
Evan Aad
  • 1,433
  • In the case of continuous distributions, the interval you refer is simple a number. So you have to check if all the sample versions give this number, which they surely do not! – kjetil b halvorsen Jan 22 '24 at 23:39
  • @kjetilbhalvorsen How about a distribution with finite support? – Evan Aad Jan 23 '24 at 00:53
  • They are all intended to be consistent in that sense. (@kjetilbhalvorsen, the formulas are not meaningful for continuous distributions where the $N$ and $h$ would be infinite.) – Matt F. Jan 23 '24 at 22:14

1 Answers1

2

The key to answering your question is in this part of the article:

The asymptotic distribution of the $p$-th sample quantile is well-known: it is asymptotically normal around the $p$-th population quantile with variance equal to: $$ \frac{p(1-p)}{N f(x_p)^2} $$

Now what does this mean? First, let's consider an experiment: Given a probability distribution $X$ that we can draw samples from, a fixed real number $ 0 < p < 1$, and an integer value $N$, we draw $N$ samples from $X$, sort the samples, and find the $p$-th quantile by finding the $N \times p$-th sample (for simplicity, if the result is an interval, meaning that $Np \notin \mathbb{N} $ and the $\lfloor Np \rfloor$-th sample is not equal to the $\lceil Np \rceil$-th sample, let's just go with the middle of the interval, i.e., the average of the two samples), recording it as the result of the experiment.

What the quote from the article states, is that if we know the $p$-th quantile of $X$ (again choosing the middle of the interval, if needed) and we choose a large enough $N$ (hence the term "asymptotic"), and perform the above experiment many many times, the results of the different experiments would be distributed similar to a normal distribution around $x_p$ with the variance above, given that the "true" $p$-th quantile $x_p$, a.k.a the $p$-th population quantile, is a possible output of the distribution ($f(x_p) \neq 0$).

This means that there are three possible cases:

  1. If the population $p$-th quantile $x_p$ has finite, non-zero probability density in the underlying distribution of the population (0 < $f(x_p) < \infty$), the results of the experiment will get closer to $x_p$ as we increase the sample size.
  2. If the $f(x_p)$ is infinite, then, for a large enough $N$, the variance will be identically zero! This means that almost all experiments will result in the correct estimation of $x_p$, e.g., discrete uniform distribution over $\{ 1, 2, 3 \}$.
  3. If $f(x_p) = 0$, however, as in your example with the discrete distribution over $\{ 1, 2, 3, 4 \}$, the statement offers nothing! Looking closer at the problem, we can see that in those cases, we would only correctly estimate $x_p$ in the middle of the interval if our sample happened to have the exact same number of members from either side of the interval. In many other cases, the result of the experiment would be one of the end points of the interval. This demonstrates that the results of our experiments indeed do not necessarily get closer to $x_p$ as we increase $N$. Note that whatever strategy we had decided to handle the intervals with, we would have faced a similar problem in this case.

Regarding the nine methods mentioned in the article, I believe it's important to note that given the asymptotic case, i.e. a constant $p$ (say 0.01) and a very large $N$ (say millions), a floor/ceiling operator and/or a $+\frac{3}{8}$ don't really matter, as their effects would pale in comparison to the magnitude of the $N \times p$ term. Meaning that all the methods boil down to the same thing, if $N$ is large enough.

Everything discussed in this answer pertained the asymptotic case, where $N$ is very large. For a small number of samples from an arbitrary distribution, I believe a similar analysis would be more complicated and less applicable to "interesting" real life use cases.

I hope this helps :)

ptohidi
  • 21
  • 1
    Note that the asymptotic Normality result holds only for sample quantiles from continuous distributions, see https://stats.stackexchange.com/a/86725/28746 – Alecos Papadopoulos Jan 24 '24 at 00:09
  • 1
    Thanks. An interesting an illuminating post. It doesn't answer my question, so I can't accept it, but I gave it an upvote. – Evan Aad Jan 24 '24 at 05:04
  • 1
    (+1) @EvanAad: It doesn't answer because it doesn't go into detail about the nine different flavours of sample quantiles or because approaching the population quantile as sample size increases wasn't what you meant by "consistent with"? – Scortchi - Reinstate Monica Jan 26 '24 at 10:00
  • @Scortchi-ReinstateMonica Both, I guess. I'll reiterate my question from my original post: "does each of the nine sample quantile formulas return a value from the corresponding population quantile interval?" – Evan Aad Jan 26 '24 at 11:03
  • @EvanAad: In short: it's random. If the population distribution has positive probably density at the quantile, the probability of returning the correct value (discrete case) or returning a close value (continuous case) increases when you increase N. – ptohidi Jan 27 '24 at 10:19
  • @EvanAad one other fact that want to make sure you're considering is that each of the methods ultimately looks at two values (at most) in the sorted sample, namely the $\lfloor h \rfloor $-th sample and the $ \lceil h \rceil $-th sample. The different methods just have slightly different formulas for calculating $h$ and interpolating between the two values. – ptohidi Jan 27 '24 at 10:32