1
  1. Suppose we have observations $x_1, x_2, \ldots, x_n$ where $n$ is very large. Now we standardize the observations as $$y_i=\frac{x_i-\bar{x}}{\frac{s}{\sqrt{n}}},$$ where $s=\frac{\sum\limits_{i=1}^n(x_i-\bar{x})^2}{n-1}$. Can we say that for $n$ sufficiently large, $y_{i_s}$ are approximately iid Students $t$ random variables with $n-1$ degrees of freedom?

  2. Suppose we divide our observations into $k$ groups and for $j=1,2,...., k$ we define $$y_j=\frac{\bar{x_j}-\bar{x}}{\frac{s_j}{\sqrt{n/k}}},$$ where $\bar{x_j}$ and $s_j$ are the mean and standard deviations respectively of the observations in the jth group and $\bar{x}$ is the sample mean of all the observations. Can we say that for $n, \frac{n}{k}$ very large, $y_{j_s}$ are approximately iid Students' $t$ random variables with $\frac{n}{k}-1$ degrees of freedom?

User1865345
  • 8,202
  • Only under restrictive conditions: you need the underlying distribution not to be too skewed. For a real example of what can happen, and how large $n$ could potentially be, see https://stats.stackexchange.com/questions/69898. There are situations where no value of $n$ will work: see https://stats.stackexchange.com/questions/375208 for instance. – whuber Oct 18 '23 at 15:58
  • @whuber if the underlying distribution is not too skewed, are (1) and (2) both correct? – user771946 Oct 19 '23 at 06:06
  • Technically, no. The second link I gave explains how the Cauchy distribution will cause problems -- and by any definition of "skewed," that distribution is not skewed at all. – whuber Oct 19 '23 at 12:10

1 Answers1

0
  1. Suppose $x_1,\dots,x_n$ are i.i.d.\ with mean $\mu$ and variance $\sigma^2$ (I wonder whether $s$ should be $\sqrt{\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n-1}}$ instead). We have $$y_i=\frac{x_i-\bar{x}}{\frac{s}{\sqrt{n}}} = \frac{x_i-\mu}{\frac{s}{\sqrt{n}}} - \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}.$$

For the second component by Slutsky Theorem, $$ \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}\overset{d}{\rightarrow} N(0,1), $$ where we utilize the fact that $s^2\overset{p}{\rightarrow}\sigma^2$ as $n\rightarrow\infty$.

For the first component, $$\frac{\sqrt{n}(x_i-\mu)}{s}\rightarrow\infty,$$ as $n\rightarrow\infty$ (think about the example where $\mu=0$ and $s=1$, so we are multiplying a random variable by $\sqrt{n}$). In conclusion, it goes to infinity instead of in distribution to $t(n-1)$, where recall that $t(n-1)$ gets close to $N(0,1)$ as $n\rightarrow\infty$.

2. We also have $$y_j=\frac{\bar{x}_j-\bar{x}}{\frac{s_j}{\sqrt{n / k}}} = \frac{\bar{x}_j-\mu}{\frac{s_j}{\sqrt{n / k}}} - \frac{\bar{x}-\mu}{\frac{s_j}{\sqrt{n / k}}}.$$ By Slutsky Theorem $$ \frac{\bar{x}_j-\mu}{\frac{s_j}{\sqrt{n / k}}}\overset{d}{\rightarrow} N(0,1), $$ as $n / k\rightarrow\infty$ (namely, we need $n\rightarrow\infty$ and $k$ does not increase too fast).

For the second component, if $k\rightarrow\infty$, $$ \sqrt{n / k}(\bar{x}-\mu) = \sqrt{n}(\bar{x}-\mu)/\sqrt{k}\rightarrow 0. $$ (It is because by CLT, $\sqrt{n}(\bar{x}-\mu)$ converges in distribution, and the denominator $\sqrt{k}\rightarrow\infty$).

In this way, $y_i\overset{d}{\rightarrow}N(0,1)$, which is very close to $t(n/k-1)$ when $n/k$ is large, given that the number of groups $k$ to goes to infinity with a rate slower than $n$ (in the sense that $n/k\rightarrow\infty$).

Additional notes on approximating the distribution of $y_j$:

Requirement 1: $k$ is sufficiently large. It is because we want the second component $\frac{\bar{x}-\mu}{\frac{s_j}{\sqrt{n / k}}}=\frac{\sqrt{n}(\bar{x}-\mu)}{s_j}\cdot \frac{1}{\sqrt{k}}\rightarrow 0$. If we assume $\frac{\sqrt{n}(\bar{x}-\mu)}{s_j}$ is approximately $N(0,1)$, by the 68-95-99.7 rule of normal distribution, this random variable is within $[-3/\sqrt{k}, 3/\sqrt{k}]$ with probability 0.997. Probably we would require k to be at least 100. But it also depends on the problem, and you may also quantify the effect of this term directly by the interval $[-3/\sqrt{k}, 3/\sqrt{k}]$ (instead of assuming it is approximately zero).

Requirement 2: $n/k$ is sufficiently large. This is required by applying CLT to $\frac{\bar{x}_j-\mu}{\frac{s_j}{\sqrt{n / k}}}$. A rule of thumb would require $n/k$ to be at least 30, but it also depends on the distribution of $x$'s. Also, if you believe $x_1,\dots,x_n$ are sampled from a normal distribution, then $$ \frac{\bar{x}_j-\mu}{\frac{s_j}{\sqrt{n / k}}}\sim t(n / k-1), $$ which has no requirement on the sample size.