I have a large survey in which students were asked, among other things, their mother's level of education. Some skipped it, and some answered wrongly. I know this, because there a sub-sample of the initial respondents mother's were later interviewed, and asked the same question. (I'm sure there is some, smaller amount, of error associated with the mothers' responses as well.)
My challenge, is to decide how to best take advantage of this second, more reliable source of data. At the very least I can use it to impute missing data more intelligently than I would be able to if I could only rely on complete cases. But if 3/4 of the children whose data I can cross-check, who respond "My mother never finished elementary school" are contradicting their mother's answer, then it would seem I should use imputation to create multiple datasets to capture the uncertainty there. [added: I said 3/4 to make a point, but now that I've checked the data I might as well tell you that closer to 40% are discrepant]
I will personally be using mother's education as a predictor in a mixed model, but if anyone has something to say about other situations I'd love to learn about them as well.
I would love to receive advice in broadstrokes or in the specifics. Thank you!
Update: I'm leaving the question unsolved for now, though I appreciate Will and Conjugate_Prior's responses, I'm holding out hope for more specific and technical feedback.
The scatterplot below will give you an idea of how the two variables are related in the 10,000 cases where both exist. They are nested, in more than 100 schools. They correlate at 0.78, Student's Answer- mean:5.12 s.d.=2.05, Mom's answer, mean=5.02, s.d.=1.92 The student's answer is missing in about 15% of cases.
