Data imputation question

Question

I have a variable with some missing values

a <- rnorm(100);
a[sample(1:100,10)] <- NA;
a;

How can I fill missing values with previous non missing value?

for example if I have sequence:

a<- (3, 2, 1, 6, 3, NA, 23, 23, NA);

first NA should be replaced by first previous non NA number 3, second NA should be replaced with 23 etc.

Thanks

Your question lacks a host of critical details. The situation you just programmed gives no context as to what you are trying to accomplish, nor to what "previous non missing value" refers to. Is your data a time series? — Andy W, Jul 08 '11 at 18:10
@ user333 As far as I can see, your "imputation approach" lacks a sound statistical base. To get a better understanding how imputation works, you might want to check out the following (non-technical) literature. — Bernd Weiss, Jul 09 '11 at 09:57
So it is a pure programming question? Than, maybe it is a better idea to ask this on SO? Here you will be rather criticized about your method... — , Jul 09 '11 at 10:36
Looks like a programming question that should be moved to SO. — Roman Luštrik, Jul 09 '11 at 19:24
Well... you could argue that... but I just never liked idea of posting R question on SO. Don;t know why? Just seems wrong to me. — user333, Jul 15 '11 at 21:05

score 2 · Accepted Answer · answered Jul 09 '11 at 17:51

To just technically answer your question

set.seed(5)
a <- rnorm(20);
a[sample(1:20,4)] <- NA;

a is:

 [1] -0.84085548  1.38435934 -1.25549186  0.07014277          NA -0.60290798
 [7] -0.47216639 -0.63537131 -0.28577363          NA  1.22763034 -0.80177945
[13] -1.08039260 -0.15753436          NA -0.13898614          NA -2.18396676
[19]  0.24081726 -0.25935541

To set each NA to the previous value:

NAs <- which(is.na(a))
a[NAs] <- a[NAs-1]

giving

 [1] -0.84085548  1.38435934 -1.25549186  0.07014277  0.07014277 -0.60290798
 [7] -0.47216639 -0.63537131 -0.28577363 -0.28577363  1.22763034 -0.80177945
[13] -1.08039260 -0.15753436 -0.15753436 -0.13898614 -0.13898614 -2.18396676
[19]  0.24081726 -0.25935541

Note that this fails if first value is missing

score 0 · Answer 2 · answered Jul 08 '11 at 18:03

0

I would outright remove any features that have far too many missing values to impute and use KNN to impute missing values for the remaining ones.

answered Jul 08 '11 at 18:03

user4673

1,651

You could also use a bagged tree model for imputation, but that requires more processing power. – Zach Jul 09 '11 at 17:16

score 0 · Answer 3 · answered Jul 09 '11 at 17:17

0

A combination of "is.na" and "lag" should do what you want, but, as previous commentators have pointed out, this may not be the best method of imputation.

answered Jul 09 '11 at 17:17

Zach

23,766

Data imputation question

3 Answers3