Testing a 2x2 contingency table: male/female, employed/unemployed

Question

I major in science, and my knowledge of statistics is rather superficial.

Problem

I had to find a data set and analyze it to the best of my ability as an assignement for my statistics course. This is no longer an assignment, I just need help in interpreting why I did my analysis badly and what I should have done instead.

I used a categorical data set of employment rates in New Zealand, planning to arrange it in a 2x2 contingency table and use Pearson's chi-squared test and Fisher's exact test to test whether gender correlates with employment.

What I want to answer

Understand why I cannot use chi-squared test and Fisher's exact test for this problem and learn what I should have used instead. "Odds-ratio as a function of time", I assume? Any useful links on how do that, perfectly in R?
Understand the "sequential correlation" comment regarding the first part of the assignment and what exactly should I have done.

Way to help me #1 (shorter)

That's how our data looks (based on a census):

                 Male     Female
Employed      1201600    1060200
Unemployed      73300      75000

I did a chi-squared test and a Fisher's exact test in R, assuming that the obtained p-value will tell me the probability of such a distribution of jobs (or one more extreme) given that the null is true (that males and females have equal chances of getting a job). I got a very small p-value, and Fisher's test gave me odds ratio of 1.16, meaning that there is a correlation, and specifically males are 16% more likely to find a job in NZ.

However, according to my lecturer, I used these tests inappropriately. I didn't quite understand why, but I think he was saying that these tests assume independence, and because there's a given amount of jobs available in NZ, our samples are not independent... I'm not sure about it though (you can see his feedback quoted below).

Way to help me #2 (longer)

If you have some spare time, I would appreciate it very much if you could look at the whole assignment. I will also provide the lecturer's feedback, so if you could interpret it for me, it would be great! The assignment is very easy for a mathematician / statistician, there's only two questions there, it's just full of padding where I tried to demonstrate that I know what I'm doing, you can skip most of it.

Here's the link to a PDF file with the assignment I didn't succeed in: statistics assignment.pdf.

Lecturer's feedback

Your figure 1 exhibits sequential correlation which is the real reason why linear regression does not work. Neither fisher's test nor chi squared is good for your 2x2 table. This is because you want to test homogeneity, but you are rejecting the null because of non-independence (which is not interesting). The distinction between the two is irrelevant here (they are asymptotically identical in any case). You could have plotted the odds ratio as a function of time.

@tomka I disagree with the [tag:self-study] tag in this case and so have removed it. This question deals with actual data and concerns a genuine problem, not just a routine textbook situation. The criteria for the [tag:self-study] tag are not whether the question originates with classroom work but rather concern the nature of the question itself. Please visit the meta threads http://meta.stats.stackexchange.com/questions/1904 and http://meta.stats.stackexchange.com/questions/1172 for more information or to discuss this. — whuber, Jan 11 '14 at 20:31
Are those employment numbers based on a census or a weighted survey file (ie a sample)? — probabilityislogic, Jan 12 '14 at 12:03
@tomka and whuber, I actually don't mind, but this is not a typical homework, if that's what you mean. It could as well be a disertation in a sense that the only instructions were to collect data and analyze it. — Th334, Jan 13 '14 at 01:57
@probabilityislogic, good point, it's census (small country). Does it affect the way we should approach the data? — Th334, Jan 13 '14 at 02:33

tomka · Accepted Answer · 2014-01-13T09:14:51.877

Some immediate responses:

1) Your lecturer means that the data show autocorrelation. This leads to inefficient estimates of regression coefficients in simple linear regression. Depending on whether it was covered in your course, that's a mistake.

2) Maybe I do not understand the problem fully, but IMAO the chi-squared test of independence is used correctly here, except for two other issues:

3) Your chi-square test has an immense power, because of the sample size. It's hard not be significant even if effects were very small. Furthermore, it appears you have a census of the population. In this situation statistical inference is unnecessary, because you obseve all population units. But that's not what the lecturer remarks.

4) You seem to aggregate the data across time points. You should actually test once per time point, since otherwise you aggregate effects over time (you count units multiple times). But that's also not what the lecturer remarks.

The lecturer actually remarks that you want to test the null of homogeneity, where you tests the null of independence. So what does he mean by homogeneity?

I suppose he refers to the test of marginal homogeneity in paired test data. This test is used to assess whether there was a change across time (repeated measures). This is however not what you want to assess in the first place. My guess is that he did not understand you want to test whether gender and employment at time point x are related. Maybe he also tried to suggest that what you should test is change across time (or no change, in which case the multiple repeated contingency would be called homogenous indeed).

@Herman 1) I made a mistake: the regression parameters will be inefficient meaning that the OLS estimator is not the best estimator anymore, i.e. its variance may be very large leading to falsely insignificant tests. Maybe this is a start for some details: http://stats.stackexchange.com/questions/19321/can-i-trust-a-regression-if-variables-are-autocorrelated 3)Yes, if you observe all population units, there is no need for inference about population parameters that you observe without sampling error 4)Chi-squared. In that case comment 4 does not apply. — tomka, Jan 13 '14 at 09:14

score 1 · Answer 2 · answered Jan 12 '14 at 13:14

It is very opaque feedback - sounds to me like they're saying "you didn't do well this time - try harder next time". The only way to understand it is to be brave, and ask your lecturer for a meeting to discuss things further.

Your lecturer seems to be disappointed with your choice of research questions perhaps? I think they may have been looking for some "buzz words" like "auto-/serial-/correlation" "time series" "seasonal effects/adjustment" "business cycles" "trend". I don't know what you were expected to know when doing the assignment.

Anyway, here's what I think.

Your assignment shows a good ability to perform a statistical test, but from a data analysis perspective shows a strange choice of examples. Analysis should be about telling a story. Personally I liked the choice of male vs female employment as a theme. However, I would have put the "second example" first, as it is a simpler question "is there a gender difference now?". After showing that there clearly is a difference (as you do), you could have then gone to the more complex question of "has there been a consistent gender difference over time?" Of course this question may be beyond the scope of your "statistical toolbox" to answer in a formal manner. One way you could do this with linear regression is to model the odds of being employed vs unemployed (or log-odds if this gives a better fit) for males and females. You then have a simple ols model of

$$ y_i=\beta_0+\beta_1x_i +e_i $$

Where $ y_i $ is the ratio "employed"/"unemployed" and $ x_i $ is a dummy variable equal to one if the ratio is for males and zero otherwise, and $ e_i $ is the residual. You then test if $\beta_1=0 $. You could take the model further, and include a time covariate as well as an interaction between time and gender. This is all part of building your analysis work as a story ("the plot thickens" so to speak). This of course depends on knowing about multiple regression ( which may be outside the course content).

I wouldn't have used that first example at all, of course linear regression was inappropriate. Your lecturer (probably) wants to see an example of a good use of linear regression. Of course, the ols example I gave above may also not be appropriate - this depends on assessing the model.

@probabilityslogic, I'll tell you what I was supposed to know. In my two different statistics courses combined we covered with various degrees of detail the following: bi(multi)nomial distribution, normal distribution, t.test, anova, chi-squared/fisher's exact, linear/logistic regression, hypogeometric distribution, Bayes's theorem, beta distribution. That's it. Did I have better tools to deal with my chosen question than I used? — Th334, Jan 13 '14 at 03:05
@probabilityslogic, I don't quite understand how to do "linear regression to model the odds of being employed vs unemployed for males and females". Could you please either try explaining it using the numbers from my data, or showing me R idioms, or link me to what I should read if you can, or suggest that I ask a new question? As far as theoretical equations go all I understand that in your example beta-0 is our intercept, beta-1 is our slope, x is our data, and e is some error... which is same as to say that I understand nothing. How embarasing, I'm sorry. — Th334, Jan 13 '14 at 03:43