I am working on insurance data in which a customer has a field named customer_no_dependent (customer's number of dependent). Its coming out to be a significant variable( just that it has $p<0.0001$).
This variable has almost 20% missing values. For imputation, I thought to determine proxy indicators for number of dependents. I tried age (thinking a person more aged could be having more dependents). I correlated it with premium amount as well to think that a person who has more dependents could be having less disposable income. So low premium paying could be meaning more dependents. I do understand that a demographic variable can't be fully taken out from such logic.
Now, if somebody goes into detail, he can prove my imputation to be far from perfect. What should I do in such situation? Would deleting those 20% be a correct solution? 20% for my data would be close to 2 lakh rows which is large amount of information..
I know, this question can have many possible answers. I would be grateful for any pointers how to proceed.