2

Example from Wikipedia:

Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 points, and 12 points, respectively. Our interest is in the scores of 55 students in a particular school who received a mean score of 96. We can ask whether this mean score is significantly lower than the regional mean—that is, are the students in this school comparable to a simple random sample of 55 students from the region as a whole, or are their scores surprisingly low?

They further use population sigma as the standard deviation:

$\mathrm{SE} = \frac{\sigma}{\sqrt n} = \frac{12}{\sqrt{55}} = \frac{12}{7.42} = 1.62 \,\!$

This particular region may have higher variance in scores (for example, half of students are from poor families and have to work instead of studying). Why here (and in other text books) they assume variance as equal to population variance?

Another example: Example #2. We are OK when means of population and sample are different, but we do not let variances to be different. Also here you can see it directly. Is not it a misleading?

Yes, I know that we should use one sample t-test, and I am OK with calculation of variance using true known mean under $H_0$, but the question is about Z-test which is usually taught before t-test.

UPD Probable Answer: It seems that if you have population with $SD = x$ and you subsample from population with $SD != x$ - then your population is not normally distributed, but you have a mix of normal distributions (even if the means are the same). Correct me if I am wrong. Another problem that we do not care if population is normally distributed - we care that subsample is normally distributed (if we want to make Z-test stronger). But it is another story and have nothing in common with the assumptions of original Z test...

UPD: Wrong again. Assume we subsample guys who did not finish the school and have low scores in reading test.

scores <- rnorm(10000, mean=100, sd = 12)
region_scores <- scores[which(scores < 90)]

Distribution of region scores is not normal any more. We can not apply Z-test.

It seems that we can never apply the Z-test. Or, if we satisfy the test's assumptions and sample from the population randomly, we will have nothing interesting to test - our subsample will deviate in mean randomly, and no true effect will arise.

Test assumptions are described here. Why do they want to estimate deviation in means with random sampling from a normal population? There will be no true difference in means! I'm totally confused.

Glorfindel
  • 1,118
  • 2
  • 12
  • 18
German Demidov
  • 1,731
  • 13
  • 27
  • 1
    Good anticipation of a possible problem! You can let your estimates of the variance be different for the poor and the rich. This would allow you to estimate what's known as Heteroscedasticity-consistent standard errors. – Matthew Gunn May 06 '16 at 10:58
  • Thanks for the answer. Yes, I will explore data at first, I will test for variance and (hopefully) my test will be powerful enough to detect the difference. But the question is: why they teach it in this way? Is it right to say that all the textbooks are basically wrong? What is the motivation of this assumption - simplicity's sake? Can I say "I do not recommend to use one-sample Z-test in practice even for large sample sizes"? – German Demidov May 06 '16 at 11:12
  • And another question: how to test for difference in variances when I have a population of 1 million and sample of 40 samples? Standard tests with 999999 degrees of freedom look kinda stupid. – German Demidov May 06 '16 at 11:14
  • I think this is a very interesting question - my gut feeling is that the answer is rooted in historical developments, as well as the natural progression in the academic syllabus. Also in the ease of defaulting to simpler models. Clearly in the majority of the real-life situations the t test is more realistic, because we don't know $\sigma$. Even more honest is to assume that the populations $\sigma$'s can differ. – Antoni Parellada May 06 '16 at 12:52
  • It would be interesting to get a canonical answer, although I think that discussing the Behrens-Fisher problem would be pertinent. – Antoni Parellada May 06 '16 at 12:54
  • It could be solved easily - just by adding one more assumption for the test, sample sigma = population sigma. But nobody (to my knowledge) do this - they just use this assumption without direct statement. So I am still not sure if I missed something important that makes this assumption unnecessary. – German Demidov May 06 '16 at 13:13
  • Actually in wikipedia they talk a bit about the estimation of sigma, and I agree with the beginning of wiki's article. However, other sources (and the example from wiki) just use sigma from population. =( So the article seems contradictory. – German Demidov May 06 '16 at 13:17
  • To compute S.E. , it is recommended to use population S.D. Sample S.D. or corrected S.D. is used when there is non-availablity of population S.D. (sigma). Especially, when we have a large samplle estimate of s.d., there is no problem in assuming its equivalence with population S.D. –  May 06 '16 at 13:40
  • But why it is recommended? The sample SD can be different from population SD. We assume that mean of sample can be different from population mean, but why are we sure about equality of sample SD to population SD? – German Demidov May 06 '16 at 13:48
  • @ German demidov It is an unbiased estimate being based on population . Sample mean may in some cases differ from population estimate because of small sample or constant bias. S.D. estimate calculation is such that bias or measurement error is neutralised. –  May 06 '16 at 14:27
  • @subhashc.davar but what if you measured height in population as 175 with SD = 10 and then you made a subsample from a region where the average height is 170 and SD = 15? Will the population estimation of SD be an unbiased estimate of sample SD? I think I got the trick with this, but probably your explanation will be different. – German Demidov May 06 '16 at 14:37
  • not clearly stated. It is vague. –  May 06 '16 at 14:42

0 Answers0