Unsupervised learning problem: rate individual survey submissions on randomness of data entered

Question

For my work we collect a lot of surveys which contain a mix of questions: multiple choice, checkbox and numeric. We know that in some instances enumerators might fudge the survey by entering answers at random. We also know that for the majority of surveys, the responses to certain questions will correlate strongly with the answers to other questions (e.g. the more advanced a respondent's schooling, the higher their income is likely to be).

Is it possible, with this knowledge, to design an approach to determine which surveys are most likely to have been filled out by random? We don't know ahead of time which questions the survey will contain (although we do know the types of questions).

Thanks.

If you can specify in some way what sort of pattern the enumerators are likely to use and if you are prepared to assume they all use the same one then you might make some headway but otherwise it is going to be hard to distinguish cheating enumerators from really bizarre but genuine respondents. — mdewey, Jan 11 '18 at 13:30

score 0 · Answer 1 · answered Jan 11 '18 at 13:10

Maybe you could consider your task from two different perspectives:

Supervised approach:

If you have a good amount of historical data, I think a supervised method could be a good option. For example random forests or xgboost which are methods that perform good with structured data should be ok. The way to go would be labelling those surveys in "fake"/"no fake" surveys and proceed with the design of the algorithm.

Unsupervised approach:

If, on the contrary, you don't have that much data or labelling it is not an option for some reason maybe you could try a multidimensional outlier detection. The idea would be to iterate trough all your previous samples and compute the density estimate for all of them. Once this is done you can compute the likelihood that your new survey sample belongs to the density estimated with all your data. Using some threshold value (which you can roughly check empirically if you have several fake samples) or applying multiple hypothesis testing should help you identifying fakes.

Of course, it all depends on how your data looks like. Hope this helps you at least to get you started.

Good luck!

Unsupervised learning problem: rate individual survey submissions on randomness of data entered

1 Answers1

Supervised approach:

Unsupervised approach: