Questions tagged [missing-data]

When the data present lack of information (gaps), i.e., are not complete. Hence, it is important to consider this feature when performing an analysis or test.

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Tag wiki reference: Wikipedia

1616 questions
36
votes
6 answers

Why do some people use -999 or -9999 to replace missing values?

I have a dataset. There are lots of missing values. For some columns, the missing value was replaced with -999, but other columns, the missing value was marked as 'NA'. Why would we use -999 to replace the missing value?
qqqwww
  • 503
18
votes
2 answers

80% of missing data in a single variable

There is one variable in my data have 80% of missing data. The data is missing because of non-existence (i.e. how much bank loan the company owes). I came across an article saying that dummy variable adjustment method is the solution for this…
lcl23
  • 235
14
votes
3 answers

Distinguishing missing at random (MAR) from missing completely at random (MCAR)

I've had these two explained multiple times. They continue to cook my brain. Missing Not at Random makes sense to be, and Missing Completely at Random makes sense...it's the Missing at Random that doesn't as much. What gives rise to data that would…
Fomite
  • 23,134
13
votes
3 answers

Techniques for Handling Incomplete/Missing Data

My question is directed to techniques to deal with incomplete data during the classifier/model training/fitting. For instance, in a dataset w/ a few hundred rows, each row having let's say five dimensions and a class label as the last item, most…
doug
  • 10,549
  • 1
  • 26
  • 26
13
votes
3 answers

When is it a good idea to just use the average for imputation?

Suppose we have a data set test: 1 8 12 14 . . 19 The . denotes missing values. When would it be better to use the average of the non-missing values to impute the missing values rather than assuming that the data comes from a normal distribution?
thoms
  • 151
12
votes
2 answers

How to handle non existent (not missing) data?

I've never really found any good text or examples on how to handle 'non-existent' data for inputs to any sort of classifier. I've read a lot on missing data but what can be done about data that cannot or doesn't exist in relation to multivariate…
user3484
8
votes
2 answers

Is listwise deletion / complete case analysis biased if data are not missing completely at random?

In the comments to the answer to my question I stated "Many rows have only 1 missing variable, so to exclude the row think leads to bias (they are not MCAR)" and in reply I was told "You're wrong, see Rubin's Statistical Analysis with Missing Data…
Joe King
  • 3,805
8
votes
2 answers

Is it ever okay to drop missing observations?

I have a dataset that looks at immigration applications and visa acceptances (granting of visas). The rates are calculated for "accepted" and "rejected" of visa applications. However, the dataset also has values for cases that were closed.…
EJ16
  • 145
8
votes
4 answers

Is the method of mean substitution for replacing missing data out of date?

Is the method of mean substitution for replacing missing data out of date? Are there more sophisticated models that should be used? If so, what are they?
7
votes
2 answers

MAR vs. MNAR: how can I decide?

I'm working with a big dataset (400,000 participants) and It has missings in 4 variables: 2 of which are continuous variables and have 3%, 10% missings, and the other two variables are categorical, where both of them have less than 5% missingness. I…
4
votes
1 answer

Pooling the results of random hot-deck imputation

I am using random hot-deck imputation on a repeated measures dataset. I am tempted to use Rubin's rules for pooling the results of multiple imputation, in particular for regression coefficients. Intuitively it seems the average of the coefficient…
Robert Long
  • 60,630
3
votes
1 answer

ISC exam - cheating or not

Background: I read this article on "hackaday" about alleged "large-scale cheating" on the ISC exam. It gives this as source. Here is one of the images from the site: The hack-a-day asks for speculation about the nature of the "cheating" that the…
EngrStudent
  • 9,375
3
votes
0 answers

What is the theoretical ideal when dealing with multiple causes of missing not at random (MNAR) data - TL;DR included

Background to problem I am currently in the process of computing some quantitative data (Questionnaire likert scales) and there is clear differences in missing data on a specific item ~400 missing responses, compared to ~100 (on the other 9 items).…
2
votes
1 answer

How to cope with missing values in sequential data before applying moving averages (and in general)?

I have a set datasets with sequential measurements. Since the size of these sets is quite big (>80000 measurements) I decided to simplify them by applying a Simple Moving Average (SMA) and selecting the data every n measurements. Each set belongs to…
Bakaburg
  • 2,917
2
votes
1 answer

Listwise deletion appropriate?

I have a data set with 440 responses. I have 11 people who did not answer any question on the survey. Then there are a couple of missing values here and there outside of the full 11 non-responses. Is list wise deletion my best option? In all, I…
Cindy
  • 31
1
2 3 4