21

I'm a mathematician self-studying statistics and struggling especially with the language.

In the book I'm using, there is the following problem:

A random variable $X$ is given as $\text{Pareto}(\alpha,60)$-distributed with $\alpha>0$. (Of course, you could take any distribution depending on one parameter for the sake of this question.) Then a sample of five values $14$, $21$, $6$, $32$, $2$ is given.

First part: "Using the method of maximum likelihood, find an estimate $\hat{\alpha}$ of $\alpha$ based on [the sample]." This was no problem. The answer is $\hat{\alpha}\approx 4.6931$.

But then: "Give an estimate for the standard error of $\hat{\alpha}$."

What is meant by this? Since $\hat{\alpha}$ is just a fixed real number, I don't see in what way it could have a standard error. Am I to determine the standard deviation of $\text{Pareto}(\hat{\alpha},60)$?

If you think the question is not clear, this information would help me as well.

Stefan
  • 313

2 Answers2

21

$\hat{\alpha}$ -- a maximum likelihood estimator -- is a function of a random sample, and so is also random (not fixed). An estimate of the standard error of $\hat{\alpha}$ could be obtained from the Fisher information,

$$ I(\theta) = -\mathbb{E}\left[ \frac{\partial^2 \mathcal{L}(\theta|Y = y)}{\partial \theta^2}|_\theta \right] $$

Where $\theta$ is a parameter and $\mathcal{L}(\theta|Y = y)$ is the log-likelihood function of $\theta$ conditional on random sample $y$. Intuitively, the Fisher information indicates the steepness of the curvature of the log-likelihood surface around the MLE, and so the amount of 'information' that $y$ provides about $\theta$.

For a $\mathrm{Pareto}(\alpha,y_0)$ distribution with a single realization $Y = y$, the log-likelihood where $y_0$ is known:

$$ \begin{aligned} \mathcal{L}(\alpha|y,y_0) &= \log \alpha + \alpha \log y_0 - (\alpha + 1) \log y \\ \mathcal{L}'(\alpha|y,y_0) &= \frac{1}{\alpha} + \log y_0 - \log y \\ \mathcal{L}''(\alpha|y,y_0) &= -\frac{1}{\alpha^2} \end{aligned} $$ Plugging in to the definition of Fisher information, $$ I(\alpha) = \frac{1}{\alpha^2} $$ For a sample $\{y_1, y_2, ..., y_n\}$ The maximum likelihood estimator $\hat{\alpha}$ is asymptotically distributed as: $$ \begin{aligned} \hat{\alpha} \overset{n \rightarrow \infty}{\sim} \mathcal{N}(\alpha,\frac{1}{nI(\alpha)}) = \mathcal{N}(\alpha,\frac{\alpha^2}{n}),~ \end{aligned} $$ Where $n$ is the sample size. Because $\alpha$ is unknown, we can plug in $\hat{\alpha}$ to obtain an estimate the standard error: $$ \mathrm{SE}(\hat{\alpha}) \approx \sqrt{\hat{\alpha}^2/n} \approx \sqrt{4.6931^2/5} \approx 2.1 $$

Nate Pope
  • 1,136
  • 1
    For your second to last line, $\begin{aligned} \hat{\alpha} \overset{n \rightarrow \infty}{\sim} \mathcal{N}(\alpha,\frac{1}{nI(\alpha)}) \end{aligned}$, it doesn't appear the notation is correct. If $n \to \infty$, then $n$ can't appear on the right side. Instead, you want $\begin{aligned}\hat{\alpha} \dot{\approx} \mathcal{N}(\alpha,\frac{1}{nI(\alpha)})\end{aligned}$ – user321627 Feb 21 '18 at 04:54
18

The other answer has covered the derivation of the standard error, I just want to help you with notation:

Your confusion is due to the fact that in Statistics we use exactly the same symbol to denote the Estimator (which is a function), and a specific estimate (which is the value that the estimator takes when receives as input a specific realized sample).

So $\hat \alpha = h(\mathbf X)$ and $\hat \alpha(\mathbf X = \mathbf x) = 4.6931$ for $\mathbf x = \{14,\,21,\,6,\,32,\,2\}$. So $\hat \alpha(X)$ is a function of random variables and so a random variable itself, that certainly has a variance.

In ML estimation, in many cases what we can compute is the asymptotic standard error, because the finite-sample distribution of the estimator is not known (cannot be derived).

Strictly speaking, $\hat \alpha$ does not have an asymptotic distribution, since it converges to a real number (the true number in almost all cases of ML estimation). But the quantity $\sqrt n (\hat \alpha - \alpha)$ converges to a normal random variable (by application of the Central Limit Theorem).

A second point of notational confusion: most, if not all texts, will write $\text {Avar}(\hat \alpha)$ ("Avar" = asymptotic variance") while what they mean is $\text {Avar}(\sqrt n (\hat \alpha - \alpha))$, i.e. they refer to the asymptotic variance of the quantity $\sqrt n (\hat \alpha - \alpha)$, not of $\hat \alpha$... For the case of a basic Pareto distribution we have

$$\text {Avar}[\sqrt n (\hat \alpha - \alpha)] = \alpha^2$$

and so $$\text {Avar}(\hat \alpha ) = \alpha^2/n$$

(but what you will find written is $\text {Avar}(\hat \alpha ) = \alpha^2$)

Now, in what sense the Estimator $\hat \alpha$ has an "asymptotic variance", since as said, asymptotically it converges to a constant? Well, in an approximate sense and for large but finite samples. I.e. somewhere in-between a "small" sample, where the Estimator is a random variable with (usually) unknown distribution, and an "infinite" sample, where the estimator is a constant, there is this "large but finite sample territory" where the Estimator has not yet become a constant and where its distribution and variance is derived in a roundabout way, by first using the Central Limit Theorem to derive the properly asymptotic distribution of the quantity $Z = \sqrt n (\hat \alpha - \alpha)$ (which is normal due to the CLT), and then turning things around and writing $\hat \alpha = \frac 1{\sqrt n} Z + \alpha$ (while taking one step back and treating $n$ as finite) which shows $\hat \alpha$ as an affine function of the normal random variable $Z$, and so normally distributed itself (always approximately).

  • +1 for distinguishing between $\hat{\alpha}$ and $\sqrt{n}(\hat{\alpha} - \alpha)$ -- certainly the notation can be inconsistent. – Nate Pope Mar 02 '14 at 21:27