9

I major in science, and my knowledge of statistics is rather superficial.

Problem

I had to find a data set and analyze it to the best of my ability as an assignement for my statistics course. This is no longer an assignment, I just need help in interpreting why I did my analysis badly and what I should have done instead.

I used a categorical data set of employment rates in New Zealand, planning to arrange it in a 2x2 contingency table and use Pearson's chi-squared test and Fisher's exact test to test whether gender correlates with employment.

What I want to answer

  1. Understand why I cannot use chi-squared test and Fisher's exact test for this problem and learn what I should have used instead. "Odds-ratio as a function of time", I assume? Any useful links on how do that, perfectly in R?
  2. Understand the "sequential correlation" comment regarding the first part of the assignment and what exactly should I have done.

Way to help me #1 (shorter)

That's how our data looks (based on a census):

                 Male     Female
Employed      1201600    1060200
Unemployed      73300      75000

I did a chi-squared test and a Fisher's exact test in R, assuming that the obtained p-value will tell me the probability of such a distribution of jobs (or one more extreme) given that the null is true (that males and females have equal chances of getting a job). I got a very small p-value, and Fisher's test gave me odds ratio of 1.16, meaning that there is a correlation, and specifically males are 16% more likely to find a job in NZ.

However, according to my lecturer, I used these tests inappropriately. I didn't quite understand why, but I think he was saying that these tests assume independence, and because there's a given amount of jobs available in NZ, our samples are not independent... I'm not sure about it though (you can see his feedback quoted below).

Way to help me #2 (longer)

If you have some spare time, I would appreciate it very much if you could look at the whole assignment. I will also provide the lecturer's feedback, so if you could interpret it for me, it would be great! The assignment is very easy for a mathematician / statistician, there's only two questions there, it's just full of padding where I tried to demonstrate that I know what I'm doing, you can skip most of it.

Here's the link to a PDF file with the assignment I didn't succeed in: statistics assignment.pdf.

Lecturer's feedback

Your figure 1 exhibits sequential correlation which is the real reason why linear regression does not work. Neither fisher's test nor chi squared is good for your 2x2 table. This is because you want to test homogeneity, but you are rejecting the null because of non-independence (which is not interesting). The distinction between the two is irrelevant here (they are asymptotically identical in any case). You could have plotted the odds ratio as a function of time.

Th334
  • 193
  • you could add the self-study tag – tomka Jan 11 '14 at 17:11
  • 3
    @tomka I disagree with the [tag:self-study] tag in this case and so have removed it. This question deals with actual data and concerns a genuine problem, not just a routine textbook situation. The criteria for the [tag:self-study] tag are not whether the question originates with classroom work but rather concern the nature of the question itself. Please visit the meta threads http://meta.stats.stackexchange.com/questions/1904 and http://meta.stats.stackexchange.com/questions/1172 for more information or to discuss this. – whuber Jan 11 '14 at 20:31
  • 1
    Are those employment numbers based on a census or a weighted survey file (ie a sample)? – probabilityislogic Jan 12 '14 at 12:03
  • @tomka and whuber, I actually don't mind, but this is not a typical homework, if that's what you mean. It could as well be a disertation in a sense that the only instructions were to collect data and analyze it. – Th334 Jan 13 '14 at 01:57
  • @probabilityislogic, good point, it's census (small country). Does it affect the way we should approach the data? – Th334 Jan 13 '14 at 02:33

2 Answers2

2

Some immediate responses:

1) Your lecturer means that the data show autocorrelation. This leads to inefficient estimates of regression coefficients in simple linear regression. Depending on whether it was covered in your course, that's a mistake.

2) Maybe I do not understand the problem fully, but IMAO the chi-squared test of independence is used correctly here, except for two other issues:

3) Your chi-square test has an immense power, because of the sample size. It's hard not be significant even if effects were very small. Furthermore, it appears you have a census of the population. In this situation statistical inference is unnecessary, because you obseve all population units. But that's not what the lecturer remarks.

4) You seem to aggregate the data across time points. You should actually test once per time point, since otherwise you aggregate effects over time (you count units multiple times). But that's also not what the lecturer remarks.

The lecturer actually remarks that you want to test the null of homogeneity, where you tests the null of independence. So what does he mean by homogeneity?

I suppose he refers to the test of marginal homogeneity in paired test data. This test is used to assess whether there was a change across time (repeated measures). This is however not what you want to assess in the first place. My guess is that he did not understand you want to test whether gender and employment at time point x are related. Maybe he also tried to suggest that what you should test is change across time (or no change, in which case the multiple repeated contingency would be called homogenous indeed).

tomka
  • 6,572
  • Could I get a quick description (or a link) of what is an autocorrelation and how it leads to bias? 3) So any statistical test is inappropriate here because of census? How could I answer my question then? 4) What test are you talking about: regression or chi-squared? In the latter I focused on the last data-point only -- the most recent census.
  • – Th334 Jan 13 '14 at 03:24
  • @Herman 1) I made a mistake: the regression parameters will be inefficient meaning that the OLS estimator is not the best estimator anymore, i.e. its variance may be very large leading to falsely insignificant tests. Maybe this is a start for some details: http://stats.stackexchange.com/questions/19321/can-i-trust-a-regression-if-variables-are-autocorrelated 3)Yes, if you observe all population units, there is no need for inference about population parameters that you observe without sampling error 4)Chi-squared. In that case comment 4 does not apply. – tomka Jan 13 '14 at 09:14