How to interpret the Delta Method?

Question

I'm reading through https://www.statlect.com/asymptotic-theory/delta-method it defined the Delta Method as:

The delta method is a method that allows us to derive, under appropriate conditions, the asymptotic distribution of $g(\hat{\theta}_n)$ from the asymptotic distribution of $\hat{\theta}$.

and one example says, in short:

A sequence of $\hat{\theta}_i$ is asymptotically normal with mean=1 and variance=1. We want to derive the asymptotic distribution of the sequence $\hat{\theta}^2$

And the solution is:

$$\sqrt{n}(\hat{\theta}_n^2-1) \xrightarrow{D} N(0,4)$$

How do I interpret this result? This doesn't tell me the distribution of $\hat{\theta}_n^2$, instead it tells me the distribution of a shifted and scaled version of it.
The steps to arrive at the solution suggest the variance of $\hat{\theta}_n^2$ is 4, and they just plugged it into the $N(0,4)$ above. If this is true, how come the variance of $\hat{\theta}_n^2$ is the variance of $\sqrt{n}(\hat{\theta}_n^2-1)$ ?

Are you dealing with the scalar case or the vector case? Are you comfortable with matrices and vectors? or do you want to keep it simple? — Matthew Gunn, Nov 01 '16 at 07:29
That first quoted sentence is a description of what it allows us to do rather than a definition of what it is. — Glen_b, Nov 06 '16 at 03:41
I think, there was a typo in the first quoted sentence. It should be $\hat{\theta_{n}}$ in the place of $\hat{\theta}$. — Lella, Nov 06 '16 at 13:48

Matthew Gunn · Accepted Answer · 2021-09-08T13:52:03.120

Some intuition behind the delta method:

The Delta method can be seen as combining two ideas:

Continuous, differentiable functions can be approximated locally by an affine transformation.
An affine transformation of a multivariate normal random variable is multivariate normal.

The 1st idea is from calculus, the 2nd is from probability. The loose intuition / argument goes:

The input random variable $\tilde{\boldsymbol{\theta}}_n$ is asymptotically normal (by assumption or by application of a central limit theorem in the case where $\tilde{\boldsymbol{\theta}}_n$ is a sample mean).
The smaller the neighborhood, the more $\mathbf{g}(\mathbf{x})$ looks like an affine transformation, that is, the more the function looks like a hyperplane (or a line in the 1 variable case).
Where that linear approximation applies (and some regularity conditions hold), the multivariate normality of $\tilde{\boldsymbol{\theta}}_n$ is preserved when function $\mathbf{g}$ is applied to $\tilde{\boldsymbol{\theta}}_n$.
- Note that function $\mathbf{g}$ has to satisfy certain conditions for this to be true. Normality isn't preserved in the neighborhood around $x=0$ for $g(x) = x^2$ because you'll basically get both halves of the bell curve mapped to the same side: both $x=-2$ and $x=2$ get mapped to $y=4$. You need $g$ strictly increasing or decreasing in the neighborhood so that this doesn't happen.

Idea 1: Locally, any continuous, differentiable function looks affine

An idea of calculus is if you zoom in enough on a continuous, differentiable function, it will look like a line (or hyperplane in the multivariate case). If we have some vector valued function $\mathbf{g}(\mathbf{x})$, in a small enough neighborhood around $\mathbf{c}$ you can approximate $\mathbf{g}(\mathbf{c} + \boldsymbol{\epsilon}) $ with the below affine function of $\boldsymbol{\epsilon}$:

$$ \mathbf{g}(\mathbf{c} + \boldsymbol{\epsilon}) \approx \mathbf{g}(\mathbf{c}) + \frac{\partial \mathbf{g}(\mathbf{c})}{\partial \mathbf{x}'} \;\boldsymbol{\epsilon} $$

Idea 2: An affine transformation of a multivariate normal random variable is multivariate normal

Let's say we have $\tilde{\boldsymbol{\theta}}$ distributed multivariate normal with mean $\boldsymbol{\mu}$ and variance $V$. That is: $$\tilde{\boldsymbol{\theta}} \sim \mathcal{N}\left( \boldsymbol{\mu}, V\right)$$

Consider a linear transformation $A$ and consider the multivariate normal random variable defined by the linear transformation $A\tilde{\boldsymbol{\theta}}$. It's easy to show: $$A\tilde{\boldsymbol{\theta}} - A\boldsymbol{\mu} \sim \mathcal{N}\left(\mathbf{0}, AVA'\right)$$

Putting it together:

If we know that $\tilde{\boldsymbol{\theta}} \sim \mathcal{N}\left( \boldsymbol{\mu}, V\right)$ and that function $\mathbf{g}(\mathbf{x})$ can be approximated around $\boldsymbol{\mu}$ by $\mathbf{g}(\boldsymbol{\mu}) + \frac{\partial \mathbf{g}(\boldsymbol{\mu})}{\partial \mathbf{x}'} \;\boldsymbol{\epsilon}$ then putting ideas (1) and (2) together:

$$ \mathbf{g}\left( \tilde{\boldsymbol{\theta}} \right) - \mathbf{g}(\boldsymbol{\mu}) \sim \mathcal{N} \left( \mathbf{0}, \frac{\partial \mathbf{g}(\boldsymbol{\mu})}{\partial \mathbf{x}'} V \frac{\partial \mathbf{g}(\boldsymbol{\mu})}{\partial \mathbf{x}'} '\right) $$

What can go wrong?

We have a problem doing this if any component of $\frac{\partial \mathbf{g}(\mathbf{c})}{\partial \mathbf{x}'}$ is zero. (eg. $g(x) = x^2$ at $x=0$.) We need $g$ strictly increasing or decreasing in the region where $\tilde{\boldsymbol{\theta}}_n$ has probability mass.

This is also going to be a bad approximation if $g$ doesn't look like an affine function in the region where $\tilde{\boldsymbol{\theta}}_n$ has probability mass.

It may also be a bad approximation if $\tilde{\boldsymbol{\theta}}_n$ isn't normal.

This problem:

$$g(x) = x^2 \quad \quad g'(x) = 2 x $$

If $\sqrt{n}\left( \tilde{\theta} - \mu \right) \xrightarrow{d} \mathcal{N}(0, 1)$ Applying the delta method you get...

$g(\tilde{\theta})-g(\mu)\sim \mathcal{N}(\cdot,\cdot)$, am I correct? — Lella, Nov 06 '16 at 13:28
The Delta method is often used to calculate the 95% confidence interval, the two end points of which are far away from the mean. How can idea (2) still be a reasonable approximation in this case? — Heisenberg, Jul 07 '20 at 01:30
@Heisenberg As $n$ increases, the distribution of $\hat{\theta}_n$ will get tighter and tighter and the function $g$ in that neighborhood will look more and more linear. Asymptotically, this works. You're right though in anticipating what can go wrong. If $g$ is curvy and the distribution of $\hat{\theta}_n$ is sufficiently wide, the delta method won't give a good approximation because approximating $g$ with a line will create too much error. — Matthew Gunn, Jul 07 '20 at 03:45