Is it safe to drop a few rows of data if working on a big dataset

Question

I am currently working on big dataset. There are a few columns which are ordinal categorical data. In order to simply the dataset, I decided to change them into numeric. However, there are missing values within these few columns. I would like to drop these rows which contain missing values in these columns cause it only is 0.9% of the total number of rows. I have checked the target variable which I want to predict, these 0.9% rows of data doesn't contain boundary values. But, I can't find any reference to support my approach.

Is it safe to drop these rows? Data imputation could be quite complex since there are many columns that contain missing values. The dataset is also quite large to run some automate data imputation such as using the mice package.

Could you please provide some suggestions on what reference can I get to support my approach?

Many thanks!

Some of the columns are ordinal, and some are nominal? Or are you saying that some columns are somehow both? The latter might technically be allowed in the sense of a set being invariant to actions of two separate group structures, but often it isn't a point of emphasis. — Galen, Jul 25 '22 at 20:49
Hi, welcome to the site! Unfortunately, it is not in general possible for us to tell you if this is OK or not. The question is: why are these rows different? If they're different because the person working that day happened to code them in wrong, then it's generally OK to ignore. If they're coded up differently because those are the most important days, it would be a bad idea to ignore them. — John Madden, Jul 25 '22 at 21:08

score 1 · Answer 1 · answered Jul 25 '22 at 21:37

A good source of references is section 1.3.1 on such "Listwise Deletion" in Stef van Buuren's Flexible Imputation of Missing Data.

As @John Madden notes in a comment, it depends on why/how the data are missing. If the data are "Missing Completely at Random" (MCAR) in the technical sense ("the probability of being missing is the same for all cases"), then there would only be a small loss of precision and your estimates wouldn't be biased.

Otherwise, the pros and cons of listwise deletion depend on the details; "the consequences of using listwise deletion depend on more than the missing data rate alone." The van Buuren book has references to different opinions on the matter.

If there is a pattern to the missingness, you might consider taking the 99.1% of your observations that are complete, reproducing that pattern and level of missingness among them, and see how much it matters. That's related to the sensitivity analysis that van Buuren recommends.

I'm more worried about treating your ordinal predictors as numeric. That makes a pretty strong assumption about linearity and equal spacing of levels for those predictors. Name them numerically if you wish, but consider keeping them as ordered factors. See this page and others on this site for what can be better approaches.

Many thanks! That's helpful! – Charlotte Jul 26 '22 at 18:27 — Charlotte, Jul 26 '22 at 18:27

Is it safe to drop a few rows of data if working on a big dataset

1 Answers1