0

I have a data set about 5,699 PhD students. The first column called "Year" is how many years it took for candidate to graduate with a Ph.D (1, 2,...14), the second column "Uni" is which university the student received the PhD, and the third column "Res" is residency of subject (permanent or temporary)

> head(mydata)
   Year     Uni       Res
1    1 Berkeley Permanent
2    1 Berkeley Permanent
3    1 Berkeley Permanent
4    1 Berkeley Permanent
5    1 Berkeley Permanent
6    1 Berkeley Permanent

I wish to see if there is a significant difference in the number of years it took for PhD students to graduate by residency. I'm assuming I must perform a two sample t-test or a two sample z-test. I know that one performs a z-test if the standard deviation of the population is known, and a t-test if it is not known. However, I have no information on whether these 5,699 students form a population or if they are samples from a larger population. Since I am not sure, should I perform the two sample t-test?

One of the assumptions of the two sample t-test and the two sample z-test is that the data must be normally distributed. Does this mean I have check if the number of years to graduate is normal for each group (permanent, temporary) or do I combine the data and check to see if the years to graduation is normal? What kind of tests do I use to check for normality in this case? Are there other assumptions I should be aware about?

John
  • 67
  • Whether a group forms a sample or a population depends crucially on the intended scope of your inferences: are you aiming to describe these 5699 people or will you attempt to derive conclusions about a larger group of people from these data? BTW, neither test assumes data are normally distributed. You would benefit from reading some of our high-voted threads about these tests. – whuber Oct 10 '19 at 12:34
  • 1
    I'm wondering about the range and discreteness of the dependent variable (Year). If there are only a few values this variable can take, you might need an analysis appropriate for ordinal dependent variables... or maybe one appropriate for count variables. – Sal Mangiafico Oct 10 '19 at 12:51
  • @whuber: Doesn't t-test assume normality? The sample statistic in t-test follows t-distribution under the assumption of normality isn't it? Asymptotically it will follow normal but to get the sampling distribution isn't normality a required assumption? – Dayne Oct 11 '19 at 06:55
  • @Dayne The t-test assumes the sampling distribution of the statistic is sufficiently close to a Student t to permit the use of the Student t CDF in converting the t-statistic to a p-value for t-statistics close to the critical value. These italicized conditions tend to hold for many underlying non-Normal distributions and typical critical values (like 5% and 1%), especially those distributions that aren't very skewed. – whuber Oct 11 '19 at 14:01
  • @whuber can you please cite some good source for this? Also, you say t-test assumes this. I think this may be a result found later. The original does assume normality, afaik. – Dayne Oct 11 '19 at 16:26
  • @Dayne I can't remember where I saw the clearest account, but Chapter 1 of Mosteller & Tukey (1977), Data Analysis and Regression, has a nice discussion of the issues. – whuber Oct 11 '19 at 17:47
  • @whuber thanks for the reference, but I am curious how is sufficiently close defined in this context. – Dayne Oct 14 '19 at 09:31
  • @Dayne It depends on your needs. You could ask similar questions of any quantitative theory: for instance, Newtonian mechanics is perfectly fine except at really tiny distances or relativistic speeds. How tiny or how fast? It depends on how accurate you need the results to be. – whuber Oct 14 '19 at 12:14
  • Ok. So I was thinking (also hoping) that there is some theoretical underpinning to this. But if it is that for many non-normal distributions,t-test gives approximately correct answer then it's a different story. I guess, in principle, we should continue to say that t-test assumes normality. – Dayne Oct 14 '19 at 13:25

0 Answers0