1

Suppose that I have a list of numbers, say [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. These numbers are realisations of a random variable $X$ whose distribution I am interested in.

Suppose I want to estimate the 5th percentile of this distribution (to form a confidence interval for $X$). Intuitively, I would take the lowest number (i.e. 1) as my estimate for this since 90% of the numbers are higher, and 0% are lower (which in some sense 'averages out' to the desired 5%!) However, when I use numpy.percentile (with the default setting), it suggests an estimate of 1.45. Which estimate is better, and why?

Update: To clarify, my goal is to estimate the interval $I=[x,y]$ with $y > x$ such that $\mathbb{P}(X \in I)=0.95$. The numbers $x$ and $y$ are the 5th and 95th percentiles respectively of the distribution of $X$.

  • 1
    Could you please explain how an estimate of the 5th percentile would form part of a confidence interval for any property of $X$? What property specifically? Only then could we possibly justify any opinions about what procedure, if any, might be better than another one. – whuber Apr 18 '23 at 15:29
  • @whuber I am interested in the distribution of $X$. By 95% confidence interval, I mean an interval $I = [x, y]$ with $y > x$ such that $\mathbb{P}(X \in I) = 0.95$. My plan is to use the estimates of the percentiles as estimates for the endpoints $x$, $y$. – afreelunch Apr 19 '23 at 11:11
  • 1
    You might see the documentation for numpy.percentile in Python. (I assume that's what you're using.) There are several options for methods that it uses to calculate percentiles. numpy.org/doc/stable/reference/generated/numpy.percentile.html – Sal Mangiafico Apr 19 '23 at 11:18
  • Your "95% confidence interval" is called a 100% tolerance interval. Unless you make some assumptions that are tantamount to knowing this interval in the first place, the only such interval that is valid extends from $-\infty$ to $+\infty,$ because no finite amount of data can tell you what the bounds of $X$ might be. The CLT tells you nothing about the percentiles of the underlying distribution. – whuber Apr 19 '23 at 13:24
  • @whuber I'm a bit confused: I am not interested in estimating the bounds of $X$ (i.e. the support of $X$). Do you mean the bounds $x$, $y$ (defined in my earlier comment)? – afreelunch Apr 19 '23 at 14:26
  • Perhaps I should add a simple example may clarify in case matters remain unclear. Suppose I sample $X$ 100 times. To form the region $I$ defined above (which I called a 'confidence interval', though maybe that is incorrect...), I would do something like calculate the 95th highest draw and 5th highest draw. This seems like a sensible procedure since, empirically, $X$ lies in this interval about 95% of the time. My question was how exactly to go about doing this (should it be the 95th, or some average of the 95th and 96th, etc.) – afreelunch Apr 19 '23 at 14:39
  • Okay, I think I see what you're looking for. A tolerance interval with 95% coverage and $100(1-\alpha)%$ confidence is an interval $[x,y],$ with $x$ and $y$ depending on the data, for which both $\Pr(X\in I)=95%$ and, because this is a random interval, this probability statement holds with $100(1-\alpha)%$ confidence. In other words, (i) you seek a tolerance interval and (ii) you need to be aware of the fact it's random. Order statistics are often used to obtain nonparametric tolerance intervals. – whuber Apr 19 '23 at 15:19
  • 2
    If my interpretation is correct, you haven't a prayer of obtaining a nonparametric 95% tolerance interval using a sample of size 10, for what I hope are intuitively obvious reasons. But if you make parametric assumptions about the distribution of $X,$ you can obtain such tolerance intervals. They tend to be very uncertain and depend heavily on the correctness of those parametric assumptions. So, if you really are looking for a tolerance interval of some sort, please edit your post to make that clear, because you definitely are not just computing a percentile of the data! – whuber Apr 19 '23 at 15:20
  • @whuber Agreed: for this reason, I have actually drawn 10,000 samples. The 10 sample case is just a hypothetical that I used when stating the question. Note that the question is still well posed in this case: for any $n$, one may ask how best to estimate percentiles (although estimation is obviously very hard when $n$ is small, and more dramatic bias correction techniques may be needed). – afreelunch Apr 20 '23 at 14:23
  • @whuber Also, I definitely am trying to estimate percentiles! (The 5th and 95th percentiles.) Meanwhile, I don't want the statement $\mathbb{P}(X \in I)$ to hold with certain probability: it should hold with probability $1$. The reason is that $X$ is an interval defined by constants, not random variables. – afreelunch Apr 20 '23 at 14:27
  • Your question is still confusing by referring to a "confidence interval for $X.$" With the standard meaning of a confidence interval, that makes no sense. And it's still unclear whether you want a nonparametric tolerance interval or whether you can make distributional assumptions. And, finally--as with most statistical procedures--your estimate of the tolerance interval will be uncertain. You need to specify what kind of uncertainty you are concerned about. For instance, should your estimated tolerance interval have a high probability of containing the true interval? – whuber Apr 20 '23 at 14:30
  • @whuber I agree that given a finite number of draws from $X$, my estimates of $x$ and $y$ will be uncertain. However, one can ask what is the 'best estimate' (e.g. unbiased, consistent, etc.) This is really my question. – afreelunch Apr 20 '23 at 14:33
  • 1
    That's problematic when you are estimating an interval. Moreover, most such criteria make no sense in a nonparametric setting or just cannot be applied. That's why disclosing additional details about your specific problem can make the difference between a question that is too broad and one that has a clear, good answer. Right now, especially after your last comment, it's not even evident what your question is. – whuber Apr 20 '23 at 14:35
  • I agree with the comments by @whuber ... From some of the comments, it sounds like you want a bootstrap estimate for the 5th percentile and 95th percentile from your sample. You can do this. You might consider if this addresses what you want better than simply calculating the 5th and 95th percentile from your sample. ... How to calculate percentiles on small samples is a separate question. As I mentioned, see the documentation in the functions in Python or R to consider the different methods available. – Sal Mangiafico Apr 20 '23 at 14:48
  • 1
    @Sal I believe the bootstrap estimate will be identical to the usual nonparametric estimate based on order statistics, depending on which form of tolerance interval is needed. A TI can also be conceived of as a pair of confidence limits for the endpoint percentiles and consequently similar specification issues apply, such as whether you want upper, lower, or two-sided limits, whether you want them to be symmetric in probability, and so on. – whuber Apr 20 '23 at 14:52

1 Answers1

0

As pointed out by @Sal Mangiafico, some helpful documentation on numpy.percentile is provided here, which in turn draws on the useful paper:

R. J. Hyndman and Y. Fan, “Sample quantiles in statistical packages,” The American Statistician, 50(4), pp. 361-365, 1996

As explained in the paper, various approaches to the problem are possible. To illustrate a few in the context of this example:

  1. The inverted CDF method produces an estimate of $1$, which matches up with how I would have naively approached the problem.
  2. The linear method (the default) views our 10 draws as dividing the space into $9$ intervals, each with width $1/9 = 0.\dot{1}$. Since we want the 5th percentile (0.05), we want the first interval, which ranges from $1$ to $2$. How far along the interval should we be? Using linear interpolation, it's $1 + 0.05/0.\dot{1} = 1.45$.
  3. Hyndman and Fan (1996) recommend the median unbiased method when the sample distribution function is unknown. In this example, this method also yields an estimate of 1.

I believe that all methods yields identical estimates when the sample size goes to infinity.

  • This answer makes it look like you are trying to re-ask the question at https://stats.stackexchange.com/questions/178578. Would that be the case? If so, that thread has a good answer. You can find out more by searching our site for quantile Hyndman Fan. – whuber Apr 20 '23 at 15:01
  • Thanks for the link, which is useful. The questions are somewhat different since that poster was not explicitly trying to estimate the interval $I$. However, the answer is still helpful (although not as helpful as reading the Hyndman Fan paper itself!) – afreelunch Apr 20 '23 at 15:08
  • But your answer doesn't really cover estimation in any meaningful way: it--as well as the title of your question--focuses on computing percentiles of "lists of numbers." – whuber Apr 20 '23 at 15:11
  • I didn't go into the properties of the various estimators since they are already in the paper. However, I agree that a non-technical write up of these properties would be nice. – afreelunch Apr 20 '23 at 15:13