8

I have a dataset that looks at immigration applications and visa acceptances (granting of visas). The rates are calculated for "accepted" and "rejected" of visa applications.

However, the dataset also has values for cases that were closed. Normally this is when the immigrant either stopped showing up to appointments, migrated elsewhere, or died. Because these numbers are not used when the rates are calculated, the rates often show up as missing (because the cases were neither accepted nor rejected).

That being said, if the only cases for that year were "otherwise closed," will it ever be okay to drop these observations? Part of the problem that I'm having is that random years in the dataset will be dropped, because the only decisions for that year were closed.

The otherwise closed cases are very arbitrary, and as I mentioned, are most probably cases where the immigrant migrated somewhere else, and probably just used the first country as a temporary place of transit. The data does not specifically say why the immigrants left, why they were closed, etc. I'm not really sure how to deal with these missing values. I do not believe that standard imputation methods would work here, due to the rate calculations (but I could be wrong).

EJ16
  • 145
  • 1
    You would not just drop them. You could apply multiple imputation. See article and books coauthored by Donald Rubin and Rod Little. – Michael R. Chernick Mar 25 '17 at 19:10
  • 1
    Does multiple imputation assume the data are missing at random? Are these data missing at random? MI always throws me off a bit, and this is one reason. – EJ16 Mar 25 '17 at 19:28
  • 2
    You raise a good point. They classify missing data as 1) missing completely at random, 2) missing at random and 3) non randomly missing. These categories are explained in their books. If you read their work and understand your data you should be able to apply the method properly. You have three situations: immigrant stopped showing up, went elsewhere or died. This seems to be non random but you can see based on what happened to them the probability of acceptance. – Michael R. Chernick Mar 25 '17 at 19:43
  • In cases where you are unsure whether your data is MCAR, MAR or MNAR, it can be useful to consider missing data plots. Here is an example of constructing such a plot using ggplot2 and the R statistics package. – Wes Mar 25 '17 at 20:39
  • 1
    Cant you include a third category, "case dropped", in your data? Maybe then different analysis will treat it differe tly? Just dropping seems strange. – kjetil b halvorsen Mar 25 '17 at 21:33
  • Kjetil, that's interesting. But how would I go about it? If the cases dropped are not included in the rates, why create a third category (or how)? The rates are based on a standard procedure that don't include cases dropped. – EJ16 Mar 25 '17 at 21:48
  • Hi Michael, but the data just pools all the cases together on "otherwise closed." It doesn't break them apart. So, you can't really base the probability based on what happened to them. There's no real way of knowing. – EJ16 Mar 25 '17 at 21:49
  • the missing values via ggplot looks interesting, but those graphs are a bit confusing. I'm not sure how to decipher whether or not the data is MCAR...etc based on those plots. – EJ16 Mar 25 '17 at 22:00

2 Answers2

8

The important distinction is in your case not the distinction between MCAR, MAR, and NMAR, but between real missing values and mechanical missing values. Real missing values are values that exist, but for some reason weren't recorded. Mechanical missing values don't exist, but the rectangular structure of a dataset forces us to give it a value, e.g. pregnancy status if your dataset also includes males. Imputation techniques are designed for real missing values. Your example is a case of mechanical missing values; the decision has not been made, so its value does not exist. If a substantial portion of migrants move on then that is an important feature of the migration process, and imputing those values hides that feature.

Maarten Buis
  • 21,005
  • Unlike for male pregnancy, however, there could have been an accept/reject decision in the cases where people died, moved on, or stopped showing up. In survival analysis these could be treated intelligently as censored cases, provided that the censoring was uninformative. I wonder if there is some way to incorporate censored status in analysis for this case at hand. – EdM Mar 26 '17 at 15:05
  • Maarten, thank you so much. It makes sense. I also thought that perhaps it was a form of censoring (e.g. migrant death). But I don't understand what "imputing those values hides that feature."Does this mean, then, that multiple imputation should not be done? If so, what are the other options? I'm still scratching my head. – EJ16 Mar 26 '17 at 15:05
  • Perhaps, and those were recorded as accepted/denied. But there are years were no decisions were made at all, and the only decisions that were made were "otherwise closed." So, that's the part that I'm stuck at the moment. – EJ16 Mar 26 '17 at 15:07
  • Maarten, never mind. I re-read the answer, and I now understand that MI would not suffice for this. It's currently showing that about half of the cases are otherwise closed and therefore missing. I guess my question is still what to do with these cases, because the standard procedure is not to include them in the rate calculations. – EJ16 Mar 26 '17 at 15:48
  • +1 Good answer. One point worth noting is that "mechanical" MVs are more commonly referred to as "structural zeros" or null values, at least in the US literature. – user78229 Mar 26 '17 at 18:10
  • DJohnson, thank you for clarifying. I have seen some research that have imputed "zeros" for the missing values, and then use a zero inflated regression. However, in this case, the zeros actually mean something, because it is a rate. So, I'm not sure what to do with these missing values since I shouldn't convert them to zeros. – EJ16 Mar 26 '17 at 19:42
  • I think it depends on the unit of analysis. Is your data at the level of the individual case? If so, then per kejtil b halvorsen's comment, creating a third outcome dropped or unknown is preferable to deleting the observations. – user78229 Mar 26 '17 at 22:22
  • Hi, it's actually dyad year. country a-country b year. – EJ16 Mar 26 '17 at 22:58
  • Paired observations? Why not break them up into a separate record for each country-year, creating a more easily analyzed panel data model structure? – user78229 Mar 28 '17 at 13:39
  • The research has to look at pairs, because that's part of the hypothesis. So, it can't be country year. – EJ16 Mar 28 '17 at 15:21
  • Would it make sense, in the case of structural/mechanical missing values, e.g. in your pregnancy-male example, to add an indicator like "pregnancy-does-not-apply"? That is, some indicator as to whether this attribute should be "available" at all to a particular case? – boot-scootin Aug 27 '20 at 12:59
1

It is clear a mix of at least 2 different missingness processes.

  1. People that die from procedure unrelated causes/abandon/etc. due to reasons other than the likely outcome of the procedure. Here some imputation under MAR makes sense (if you can clearly identify the cases).
  2. People that give up/withdraw/drop-out due to not fulfilling some rules and/or thinking they are unlikely to be successful or that it is too much hassle. Here it depends on whether you can from the data you have assess their chances if they had continued. If you can a MAR assumption is fine, otherwise you have a difficult MNAR situation.

What to do about MNAR is difficult. Assuming such cases had no success may be a bit extreme (or very appropriate, after all they did not succeed). Or impute under MAR and look at making these cases less successful until you hit 0% and contemplate that range of values.

Björn
  • 32,022
  • Indeed. In the beginning,I believed the data were MNAR. However, I think that Maarten is right. Only it has left me a little more confused as to what to do with the structural zeros. – EJ16 Mar 26 '17 at 19:41