I have a question on selection bias, mainly in terms of sampling from a larger dataset.
Suppose you have a database, and there are some data quality issues that prevent you from using the whole database for inference. In particular, you must 'fuzzy match' names from a database, and you're only able to match 4,000 out of 3,000,000 in the dataset.
If you proceed with using this data with n=4,000 in a model, of whatever type/class, to make an inference on the population of 3,000,000, wouldn't it be possible that you're introducing selection bias?
IE, suppose that the reason you successfully match 4,000 of those observations is because those individuals were more meticulous in typing in their names. Aren't you effectively cutting out a group within the population that weren't so careful in typing in their names? Does this really matter?
If selection bias is possible here and could have an effect (depending on the analysis undertaken), are there general ways to address/remove this problem?