9

I have a large survey in which students were asked, among other things, their mother's level of education. Some skipped it, and some answered wrongly. I know this, because there a sub-sample of the initial respondents mother's were later interviewed, and asked the same question. (I'm sure there is some, smaller amount, of error associated with the mothers' responses as well.)

My challenge, is to decide how to best take advantage of this second, more reliable source of data. At the very least I can use it to impute missing data more intelligently than I would be able to if I could only rely on complete cases. But if 3/4 of the children whose data I can cross-check, who respond "My mother never finished elementary school" are contradicting their mother's answer, then it would seem I should use imputation to create multiple datasets to capture the uncertainty there. [added: I said 3/4 to make a point, but now that I've checked the data I might as well tell you that closer to 40% are discrepant]

I will personally be using mother's education as a predictor in a mixed model, but if anyone has something to say about other situations I'd love to learn about them as well.

I would love to receive advice in broadstrokes or in the specifics. Thank you!

Update: I'm leaving the question unsolved for now, though I appreciate Will and Conjugate_Prior's responses, I'm holding out hope for more specific and technical feedback.

The scatterplot below will give you an idea of how the two variables are related in the 10,000 cases where both exist. They are nested, in more than 100 schools. They correlate at 0.78, Student's Answer- mean:5.12 s.d.=2.05, Mom's answer, mean=5.02, s.d.=1.92 The student's answer is missing in about 15% of cases.

enter image description here

  • Out of curiosity, was the first response option to that education question "My mother never finished elementary school"? If so, I would be worried about the accuracy of the rest of your test results for those test-takers. – Michelle Mar 21 '12 at 23:46
  • "How far did she go in school?" - 1) Eighth grade or less – Michael Bishop Mar 22 '12 at 03:16
  • You probably have a subset of test-takers who ticked the first response option to each question. Can you check that? – Michelle Mar 22 '12 at 06:25
  • That plot is very insightful. It looks fairly symmetric which is not what you'd expect if in fact a bunch of kids just ticked off the first answer. If that were the case then the cases would tend to cluster along the bottom row. Of course 'looking' symmetric doesn't actually guarantee it is but it's a nice start. The strong correlation you observe between mother and child response is also consistent with this. – Will Mar 22 '12 at 14:22
  • After further consideration, I believe what I would do is run the model on the subsample, then repeat on the full sample using the mother's response when available. See if the results from the subsample and full sample differ at all. Maybe consider including a dummy in the full sample that is coded 1 when the child's response is used. It's not the fanciest solution but I suspect the error rate minimal and, importantly, more or less random. Before thinking up fancy solutions to the problem let's see if the problem needs 'dealt with' in the first place. – Will Mar 22 '12 at 14:30
  • @Will, I agree the problem isn't that bad and I could do something simple as you suggest, but this data is used by hundreds of people, so fairly small improvements justify more effort than one would expect. – Michael Bishop Mar 22 '12 at 14:37
  • @MichaelBishop, hmm. That is a conundrum then. I suppose the best approach would be to create multiple datasets to capture the variance in the child's guess. I'm not going to wager a guess as to how to do this in a statistically correct/acceptable manner. I would like to see someone who knows more than I do make a suggestion though! – Will Mar 22 '12 at 15:31
  • @Will, yes that's that's exactly multiple imputation you're describing. Hence I already did 'wager a guess' and discussed the statistical issues in my answer. Apparently not very clearly though. – conjugateprior Mar 22 '12 at 21:39
  • @MichaelBishop What 'specific and technical' feedback are you waiting for? Introduction, backing theory or software tools for multiple imputation maybe? Or how to use multiply imputed data in analyses? I'm happy to suggest things but you'd have to say what you want. – conjugateprior Mar 22 '12 at 21:44
  • @ConjugatePrior, yes I understand the concept and have used the method. I'm saying I've not used imputation to 'impute' values that aren't actually missing. It's not a situation I've encountered personally and, frankly, I'd be a little reluctant to replace 'real' data (however flawed) with something created in a statistician's laboratory. But I'd still be interested to see how it's done - particularly using Stat's mi command. But I'm not the one asking this question ;) – Will Mar 23 '12 at 02:39
  • 1
    Ahh. I see. Then I'd also be (more than a little) reluctant to impute existing data and would recommend it wasn't done at all, despite this sort of argument: http://gking.harvard.edu/gking/files/measure.pdf – conjugateprior Mar 23 '12 at 08:46
  • @ConjugatePrior, +1 for that link. Basically it sounds like you're opposed to attempting the project I asked about (correcting for error in survey response when you have two sources of data). That's fine, I really do appreciate you and Will giving your thoughts even if they are focused on more general issues. – Michael Bishop Mar 23 '12 at 18:03

2 Answers2

2

The first thing to note is that your variables are: "what student said about mother's education" and "what student's mother said about student's mother's education". Call them S and M respectively, and label the unobserved true level of mother's education as T.

S and M have both got missing values and there is nothing wrong (modulo the observation below) with putting M and S in an imputation model but only using one of them in the subsequent analysis. The other way around would always be unadvisable.

This is separate from three other questions:

  1. Does a missing value mean the students don't know or don't want to say that much about their mothers?
  2. How to use S and M to learn about T?
  3. Do you have the right sort of missingness to allow multiple imputation to work?

Ignorance and missingness

You might be interested in T, but you need not be: perceptions of educational attainment (via S, and possibly M) or lack of student knowledge might be more causally interesting than T itself. Imputation may be a sensible route for the first, but may or may not be for the second. You have to decide.

Learning about T

Say you are actually interested in T. In the absence of a gold standard measurement (since you sometimes doubt M) it's hard to know how you might non-arbitraily combine S and M to learn about T. If, on other hand, you were willing to treat the M as correct when it is available, then you could use S to predict M in a classification model that contains other information from the students and then use M rather than S in the final analysis. The concern here would be about selection bias in the cases you trained on, which leads to the third issue:

Missingness

Whether multiple imputation can work depends on whether data is missing completely at random (MCAR) or missing at random (MAR). Is S missing at random (MAR)? Perhaps not, since students might be ashamed to answer about their mother's lack of education and skip the question. Then the value alone determines whether it will be missing and multiple imputation cannot help here. On the other hand, if low education covaries with something that is asked and partly answered in the survey e.g. some indicator of income, then MAR may be more reasonable and multiple imputation has something to get a grip on. Is M missing at random? Same considerations apply.

Finally, even if you are interesting in T and take a classification approach, you'd still want to impute to fit that model.

1

If you're going to assume that the "contradiction rate" is the same for the whole sample as it is for the subsample whose mothers were polled then the subsample must have been drawn at random. In your description you don't say, so I raise this issue because I think it has important implications for how or if you can use this information from the subsample to draw conclusions about the entire sample of students.

It seems to me that there are three facets to this contradiction issue.

1 is the rate of contradiction. Is it really the case that 3/4th's of students guessed wrong?

2 is the degree of wrongness - it's one thing to say your mother never finished elementary school when she in fact completed it but stopped there and quite another to say she never completed elementary school when she has a Ph.D.

3 is the proportion of the sample you can cross-check. If you're drawing these conclusions on a subsample of 20 then I'd bet the estimates are fairly unstable and probably not worth much.

It seems to me that what you do will depend on your answer to these questions and to the question I raised initially. For example, if 1 is quite high and 3 is quite high then I might just use the subsample and be done with it. If 1 is high but 2 is low then the issue doesn't seem to be that bad and, again, it might not be worth bothering with.

It's probably also worth knowing if the error is random or systematic. If students tend to systematically under estimate their mother's education then that's more problematic than if they just get it totally wrong sometimes.

I've done some imputation on a couple papers and it seems like I always create more trouble for myself as a result. Reviewers, in my area at least, often don't have a good handle on the method and are thus suspicious of its use. I feel like sometimes it's better, from a publication standpoint, to just acknowledge the problem and move on. But in this case you're not really 'imputing missing data' but are introducing some sort of predicted error variance for the variable. It is a very interesting question and, putting all the concerns aside, I'm not even sure how I would go about this if I decided it was the best course of action

Will
  • 538
  • 1
    Thanks Will, I clarified some things in my original post. The sub-sample is random. I pulled the 3/4 stat out of a hat to make a point. The true stat is less. I can cross-check about 10,000 cases. I'm sure the error is not purely random. – Michael Bishop Mar 22 '12 at 02:27