1

I am looking to simulate the t-distribution from first principles.

In particular, I want to understand how the distribution arises by comparing the mean of sample A of size $n_A$ (taken from population A) with the mean of sample B of size $n_B$ (from a distinct population B). Note that $n_A$ and $n_B$ are not necessarily equal.

The null hypothesis ($H_0$) of a t-test (independent samples) states that both samples come from the same population. I'm interpreting this as population A essentially being the same as population B.

To simulate the t-distribution, this is my plan (Python):

  1. create a large (N=10000) array of normally distributed values with a mean $10$ and standard deviation, $2$. It will be one population because the null hypothesis assumes that the samples indeed come from the same underlying population.

  2. iterate 1000 times as follows:

    • take a random sample of 20 elements (sample A), and another 10 elements (sample B) from the underlying population
    • calculate the t-statistic for this realisation of samples
    • record the t-statistic from each i$^\mathrm{th}$ iteration
  3. plot a histogram of all 1000 t-statistic scores

However, point (2b) is where I have difficulty - what is the equation to calculate the t-statisitc? I have found various resources on the interweb (re-arranged slightly), but they don't appear to be entirely consistent.

Shoffma5 (slide 16) $$t = \frac{\mu_A - \mu_B}{\sqrt{ \frac{1/n_A+1/n_B}{\nu} }}\frac{1}{\sqrt{ s_A^2\big(n_A-1\big) + s_B^2\big(n_B-1\big) }}$$

ucdavis and statisticshowto $$t = \frac{\mu_A - \mu_B}{\sqrt{ \frac{1/n_A+1/n_B}{\nu}}}\frac{1}{\sqrt{ \Big(\sum A^2 - \frac{(\sum A)^2}{n_A}\Big)^2 + \Big(\sum B^2 - \frac{(\sum B)^2}{n_B}\Big)^2 }}$$

where $\mu_A$ and $\mu_B$ are the respective means of samples A and B, and $s_A^2$ and $s_B^2$ are the respective variances.

What is the correct equation to use to calculate the t-statistic (independent samples)?

Ben
  • 451
  • While the two formulas should be algebraically equivalent, don't use the second formula, since it's numerically unstable. 2. Your simulation is somewhat flawed. The only way you can have an actual normally-distributed parent population is if that population is infinite. 3. When simulating populations with equal variances, there's little point doing simulations having $\sigma\neq 1$, since any other choice is equivalent to scaling both $\sigma$ and the difference in means (e.g. $\sigma=2$ and $\mu_1-\mu_2=4$ is the same as $\sigma=1$ and $\mu_1-\mu_2=2$ for given sample size)
  • – Glen_b Jun 24 '17 at 12:28
  • Thanks for the response Glen. 1 - can you expand on 'numerically unstable'? 2 - Sure, I get that the normal distribution is theoretical in nature since is it based on an infinite population; I was merely hoping to approximate infinity by taking a 'large' population which, for the illustrative purposes of the simulation, is good enough. 3 - fair point. – Ben Jun 24 '17 at 15:13
  • (in reverse order) 2. but it's much easier to do an exact infinite-population simulation quite directly. 1. See discussion in comments here:

    https://stats.stackexchange.com/questions/210483/whats-up-with-this-variance-computation

    See the answer here, which explains what the problem is:

    https://stats.stackexchange.com/questions/235004/is-it-possible-to-have-pearson-correlation-coefficient-values-1-or-values-1/235054

    See also some of the comments under this answer: https://stats.stackexchange.com/a/235143/805

    ...ctd

    – Glen_b Jun 25 '17 at 00:51
  • ctd... and these wikipedia links:

    http://en.wikipedia.org/wiki/Loss_of_significance and

    https://en.wikipedia.org/wiki/Variance#Formulae_for_the_variance ... you should not implement those mean of squares minus square of means forms on a computer. If you need a fast one-pass algorithm, you can find mentions of them at some of the above links (though a search for one pass variance should hit something)

    – Glen_b Jun 25 '17 at 00:51
  • Just going back to the question "what is the correct equation for the t-statistic?", can I point you in the direction of another of my questions? If you feel you could provide some input there, that would help me very much. See https://stats.stackexchange.com/questions/294725/so-many-ways-to-calculate-the-t-statistic-is-this-the-super-formula-i-need-t – Ben Jul 29 '17 at 09:47