53

Are there well known formulas for the order statistics of certain random distributions? Particularly the first and last order statistics of a normal random variable, but a more general answer would also be appreciated.

Edit: To clarify, I am looking for approximating formulas that can be more-or-less explicitly evaluated, not the exact integral expression.

For example, I have seen the following two approximations for the first order statistic (ie the minimum) of a normal rv:

$e_{1:n} \geq \mu - \frac{n-1}{\sqrt{2n-1}}\sigma$

and

$e_{1:n} \approx \mu + \Phi^{-1} \left( \frac{1}{n+1} \right)\sigma$

The first of these, for $n=200$, gives approximately $e_{1:200} \geq \mu - 10\sigma$ which seems like a wildly loose bound.

The second gives $e_{1:200} \approx \mu - 2.58\sigma$ whereas a quick Monte Carlo gives $e_{1:200} \approx \mu - 2.75\sigma$, so it's not a bad approximation but not great either, and more importantly I don't have any intuition about where it comes from.

Any help?

Chris Taylor
  • 3,682
  • 4
    If you use R, see the ppoints function. – cardinal Mar 31 '11 at 13:41
  • 1
    @probabilityislogic has given some good intuition for the approximations you list. Would it be helpful at all if I gave some more from an alternative viewpoint, or have you satisfied your curiosity on this matter? – cardinal Apr 01 '11 at 19:59

4 Answers4

40

The classic reference is Royston (1982)[1] which has algorithms going beyond explicit formulas. It also quotes a well-known formula by Blom (1958): $E(r:n) \approx \mu + \Phi^{-1}(\frac{r-\alpha}{n-2\alpha+1})\sigma$ with $\alpha=0.375$. This formula gives a multiplier of -2.73 for $n=200, r=1$.

[1]: Algorithm AS 177: Expected Normal Order Statistics (Exact and Approximate) J. P. Royston. Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 31, No. 2 (1982), pp. 161-165

onestop
  • 17,737
  • 2
  • 62
  • 89
Aniko
  • 11,014
  • Thanks for the reference and the answer! I was wondering whether there's any such result on the (possibly normal) approximation of the CDF's/PDF's of these order statistics, and secondly a decent approximation for the ratio of the last to first order statistics. You can assume that the samples are from a normal distribution, just like the OP assumed. – Mathmath Aug 21 '20 at 10:43
29

$$\newcommand{\Pr}{\mathrm{Pr}}\newcommand{\Beta}{\mathrm{Beta}}\newcommand{\Var}{\mathrm{Var}}$$The distribution of the ith order statistic of any continuous random variable with a PDF is given by the "beta-F" compound distribution. The intuitive way to think about this distribution, is to consider the ith order statistic in a sample of $N$. Now in order for the value of the ith order statistic of a random variable $X$ to be equal to $x$ we need 3 conditions:

  1. $i-1$ values below $x$, this has probability $F_{X}(x)$ for each observation, where $F_X(x)=\Pr(X<x)$ is the CDF of the random variable X.
  2. $N-i$ values above $x$, this has probability $1-F_{X}(x)$
  3. 1 value inside a infinitesimal interval containing $x$, this has probability $f_{X}(x)dx$ where $f_{X}(x)dx=dF_{X}(x)=\Pr(x<X<x+dx)$ is the PDF of the random variable $X$

There are ${N \choose 1}{N-1 \choose i-1}$ ways to make this choice, so we have:

$$f_{i}(x_{i})=\frac{N!}{(i-1)!(N-i)!}f_{X}(x_{i})\left[1-F_{X}(x_{i})\right]^{N-i}\left[F_{X}(x_{i})\right]^{i-1}dx$$

EDIT in my original post, I made a very poor attempt at going further from this point, and the comments below reflect this. I have sought to rectify this below

If we take the mean value of this pdf we get:

$$E(X_{i})=\int_{-\infty}^{\infty} x_{i}f_{i}(x_{i})dx_{i}$$

And in this integral, we make the following change of variable $p_{i}=F_{X}(x_{i})$ (taking @henry's hint), and the integral becomes:

$$E(X_{i})=\int_{0}^{1} F_{X}^{-1}(p_{i})\Beta(p_{i}|i,N-i+1)dp_{i}=E_{\Beta(p_{i}|i,N-i+1)}\left[F_{X}^{-1}(p_{i})\right]$$

So this is the expected value of the inverse CDF, which can be well approximated using the delta method to give:

$$E_{\Beta(p_{i}|i,N-i+1)}\left[F_{X}^{-1}(p_{i})\right]\approx F_{X}^{-1}\left[E_{\Beta(p_{i}|i,N-i+1)}\right]=F_{X}^{-1}\left[\frac{i}{N+1}\right]$$

To make a better approximation, we can expand to 2nd order (prime denoting differentiation), and noting that the second derivative of an inverse is:

$$\frac{\partial^{2}}{\partial a^{2}}F_{X}^{-1}(a)=-\frac{F_{X}^{''}(F_{X}^{-1}(a))}{\left[F_{X}^{'}(F_{X}^{-1}(a))\right]^{3}}=-\frac{f_{X}^{'}(F_{X}^{-1}(a))}{\left[f_{X}(F_{X}^{-1}(a))\right]^{3}}$$

Let $\nu_{i}=F_{X}^{-1}\left[\frac{i}{N+1}\right]$. Then We have:

$$E_{\Beta(p_{i}|i,N-i+1)}\left[F_{X}^{-1}(p_{i})\right]\approx F_{X}^{-1}\left[\nu_{i}\right]-\frac{\Var_{\Beta(p_{i}|i,N-i+1)}\left[p_{i}\right]}{2}\frac{f_{X}^{'}(\nu_{i})}{\left[f_{X}(\nu_{i})\right]^{3}}$$ $$=\nu_{i}-\frac{\left(\frac{i}{N+1}\right)\left(1-\frac{i}{N+1}\right)}{2(N+2)}\frac{f_{X}^{'}(\nu_{i})}{\left[f_{X}(\nu_{i})\right]^{3}}$$

Now, specialising to normal case we have $$f_{X}(x)=\frac{1}{\sigma}\phi(\frac{x-\mu}{\sigma})\rightarrow f_{X}^{'}(x)=-\frac{x-\mu}{\sigma^{3}}\phi(\frac{x-\mu}{\sigma})=-\frac{x-\mu}{\sigma^{2}}f_{X}(x)$$ $$F_{X}(x)=\Phi(\frac{x-\mu}{\sigma})\implies F_{X}^{-1}(x)=\mu+\sigma\Phi^{-1}(x)$$

Note that $f_{X}(\nu_{i})=\frac{1}{\sigma}\phi\left[\Phi^{-1}\left(\frac{i}{N+1}\right)\right]$ And the expectation approximately becomes:

$$E[x_{i}]\approx \mu+\sigma\Phi^{-1}\left(\frac{i}{N+1}\right)+\frac{\left(\frac{i}{N+1}\right)\left(1-\frac{i}{N+1}\right)}{2(N+2)}\frac{\sigma\Phi^{-1}\left(\frac{i}{N+1}\right)}{\left[\phi\left[\Phi^{-1}\left(\frac{i}{N+1}\right)\right]\right]^{2}}$$

And finally:

$$E[x_{i}]\approx \mu+\sigma\Phi^{-1}\left(\frac{i}{N+1}\right)\left[1+\frac{\left(\frac{i}{N+1}\right)\left(1-\frac{i}{N+1}\right)}{2(N+2)\left[\phi\left[\Phi^{-1}\left(\frac{i}{N+1}\right)\right]\right]^{2}}\right]$$

Although as @whuber has noted, this will not be accurate in the tails. In fact I think it may be worse, because of the skewness of a beta with different parameters

  • 1
    "Maximum likelihood estimator of a random variable"? Not sure what that is, but I think you've (almost) calculated the *mode*. – cardinal Mar 31 '11 at 14:39
  • 1
    Something mysterious happens about two-thirds of the way through when suddenly $\mu$ and $\sigma$ appear without warning or definition. – whuber Mar 31 '11 at 15:17
  • 2
    I don't mean to "pile on", but it's also hard for me to see how the quantity in brackets can be approximated by a negative number. – cardinal Mar 31 '11 at 15:34
  • mean(qnorm(qbeta(((1:100000) - 0.5 ) / 100000, 1, 200))) in R will give an approximate numerical integration, and in this case gives -2.746042. Just change the 1, 200 to get other quantiles. – Henry Mar 31 '11 at 21:58
  • @cardinal - my mistake, this should be the inverse CDF $-\Phi^{-1}(\frac{1}{N+1})$ as defined in the question. And I have calculated the mode of the PDF, which is (but for an idealogical restriction) a maximum likelihood estimator - all I have to do is replace the "random variable" with a "greek letter" and then magically - its an MLE! But clearly my answer is in need of repair – probabilityislogic Apr 01 '11 at 09:48
  • @whuber - they do not appear out of nowhere, for they are defined in the question. but I will update my answer to "introduce" them better. – probabilityislogic Apr 01 '11 at 09:49
  • 1
    @probabilityislogic, while at the level of calculus, you might say that in this case we're considering a bivariate function and simply maximizing over one variable instead of another, I think there are reasons mathematical, statistical, and pedagogical not to call what you've done "maximum likelihood estimation". They are too numerous to enumerate in this space, but a simple one that I think is compelling enough is that we use a particular, arcane vocabulary in statistics for a reason. Changing that on a whim for a single problem can lead to misunderstanding(s).../... – cardinal Apr 01 '11 at 11:52
  • 1
    .../...Besides, we already have a specific, arcane term in our vocabulary to describe what you are doing. So, "renaming" it to use another specific, arcane term that does not (normally) describe what you are doing seems unwarranted (at least to me). – cardinal Apr 01 '11 at 11:54
  • @cardinal - point taken, I have removed this from my answer, and acknowledged its inadequacy. – probabilityislogic Apr 01 '11 at 12:59
  • 1
    @probability +1 for a nice exposition of working with order statistics. However, this approximation cannot be expected to work well in the tails (extreme values of i), because the inverse CDF has asymptotes at 0 and 1 and its Taylor series approximation is poor there. E.g., with N=200 and i=1 and the standard normal distribution, the approximation gives -2.57755 compared to -2.74604--a fairly big difference--whereas for n=100, the approximation is -0.00376 compared to -0.00626, a small difference. A better approach is to substitute $p_i=F_X(u)$ in the integral and use a saddlepoint approx. – whuber Apr 01 '11 at 16:27
  • 2
    @probabilityislogic (+1) for the revised answer. One suggestion, maybe $\Rightarrow$ is better than $\to$ to mean "implies". It took staring at a couple lines for a few seconds to realize you weren't making some convergence claim. – cardinal Apr 01 '11 at 20:25
  • What is an asymptotics for $E[x_{i}]$ as $N\rightarrow\infty$, at least for $i=1$ or $i=N$? – Hans Mar 22 '18 at 19:38
  • This thread and the post has been of help to me! I was wondering whether there's any such result on the (possibly normal) approximation of the CDF's/PDF's of these order statistics, and secondly a decent approximation for the ratio of the last to first order statistics. You can assume that the samples are from a normal distribution, just like the OP assumed. – Mathmath Aug 21 '20 at 10:45
16

Aniko's answer relies on Blom's well known formula that involves a choice of $\alpha = 3/8$. It turns out that this formula is itself a mere approximation of an exact answer due to G. Elfving (1947), The asymptotical distribution of range in samples from a normal population, Biometrika, Vol. 34, pp. 111-119. Elfving's formula is aimed at the minimum and maximum of the sample, for which the correct choice of alpha is $\pi/8$. Blom's formula results when we approximate $\pi$ by $3$.

By using the Elfving formula rather than Blom's approximation, we get a multiplier of -2.744165. This number is closer to Erik P.'s exact answer (-2.746) and to the Monte Carlo approximation (-2.75) than is Blom's approximation (-2.73), while being easier to implement than the exact formula.

whuber
  • 322,774
  • 1
    Could you provide a bit more detail as to how $\alpha=\pi/8$ is arrived at through Elfving (1947)? It's not obvious in the article. – Anthony May 18 '15 at 14:12
  • 2
    Anthony - I am relying on the textbook Mathematical Statistics, by Samuel Wilks, pub. Wiley (1962). Exercise 8.21 on p. 249 states: "If x_(1), x_(n) are the smallest and largest order statistics of a sample of size n from a continuous c.d.f. F(x)...the random variable 2n*sqrt{[F(x_(1))][1-F(x_(n))]} has a limit distribution as n -> infinity, with mean pi/2 and variance 4-(pi^2)/4." (Sorry I don't know markup code!) For a symmetric distribution, F(x_(1)) = 1-F(x_(n)). Thus F(x_(n)) is about pi/(4n), or x_(n) is about F^(-1)(pi/(4n)). The Blom formula uses the approximation 3/(4n). – Hal M. Switkay May 18 '15 at 15:45
  • This reminds me of the Infamous "$\pi=3$" bill attributed to the Indiana State Legislature. (Though the wikipedia article suggests that the popular version of the story is not accurate.) – steveo'america Oct 02 '19 at 22:27
  • @HalM.Switkay. : – MSIS Jun 20 '22 at 23:33
11

Depending on what you want to do, this answer may or may not help - I got the following exact formula from Maple's Statistics package.

with(Statistics):
X := OrderStatistic(Normal(0, 1), 1, n):
m := Mean(X):
m;

$$\int _{-\infty }^{\infty }\!1/2\,{\frac {{\it \_t0}\,n!\,\sqrt {2}{ {\rm e}^{-1/2\,{{\it \_t0}}^{2}}} \left( 1/2-1/2\, {{\rm erf}\left(1/2\,{\it \_t0}\,\sqrt {2}\right)} \right) ^{-1+n}}{ \left( -1+n \right) !\,\sqrt {\pi }}}{d{\it \_t0}}$$

By itself this isn't very useful (and it could probably be derived fairly easily by hand, since it's the minimum of $n$ random variables), but it does allow for quick and very accurate approximation for given values of $n $ - much more accurate than Monte Carlo:

evalf(eval(m, n = 200));
evalf[25](eval(m, n = 200));

gives -2.746042447 and -2.746042447451154492412344, respectively.

(Full disclosure - I maintain this package.)

Erik P.
  • 2,012
  • 1
    @ProbabilityIsLogic derived this integral for all order statistics in the first half of his reply. – whuber Mar 31 '11 at 19:42