Selection Bias - Taking a subset of a larger database

Question

I have a question on selection bias, mainly in terms of sampling from a larger dataset.

Suppose you have a database, and there are some data quality issues that prevent you from using the whole database for inference. In particular, you must 'fuzzy match' names from a database, and you're only able to match 4,000 out of 3,000,000 in the dataset.

If you proceed with using this data with n=4,000 in a model, of whatever type/class, to make an inference on the population of 3,000,000, wouldn't it be possible that you're introducing selection bias?

IE, suppose that the reason you successfully match 4,000 of those observations is because those individuals were more meticulous in typing in their names. Aren't you effectively cutting out a group within the population that weren't so careful in typing in their names? Does this really matter?

If selection bias is possible here and could have an effect (depending on the analysis undertaken), are there general ways to address/remove this problem?

Do you only have access to those observations in the larger set for which you can get a fuzzy match, or are you able to randomly select records from the larger dataset? Are you able to get the distributions of features across the large dataset? — Tchotchke, Apr 26 '16 at 16:24
So the dataset consists of two tables that I'm joining by matching. Due to quality issues the match rate is poor. Both tables contain information that I need to use in the analysis. So in a sense, I am restricted to just the observations that I can fuzzy match. I could get the distribution of 'features' or the variables for the entire population on each table, though, and see how those are distributed in the population — user113574, Apr 26 '16 at 17:50
Could you regard each observation which you can match as representative of all the other observations which (a) have similar characteristics (b) you cannot match. Then weight each one accordingly? Perhaps with the inverse of the probability of being matched? — mdewey, Apr 27 '16 at 15:30

score 0 · Answer 1 · answered Apr 27 '16 at 14:21

Worrying about selection bias seems justified in this case, as you're ending with with a non-random sample of ~1% of your population. You could look at the distributions across the features for your subset and compare that to the overall population. If those distributions are quite different for many of the features, or the features you consider important, then selection bias would seem to be a problem.

Without knowing more about your dataset, here are some additional thoughts: If you expected to be able to match to significantly more of the 3 million observations, I'd spend more time investigating why you aren't getting matches (ie, is capitalization different between the fields, are there trailing blanks, etc.). If that's not the case, you could still look for matches outside of just name - perhaps first name and date of birth, then consider it a match if the edit distance when comparing the last names is < X.

Selection Bias - Taking a subset of a larger database

1 Answers1