3

I'm using a data frame with many NA values. While I'm able to create a linear model, I am subsequently unable to line the fitted values of the model up with the original data due to the missing values and lack of indicator column.

Here's a reproducible example:

library(MASS)
dat <- Aids2
# Add NA's 
dat[floor(runif(100, min = 1, max = nrow(dat))),3] <- NA
# Create a model
model <- lm(death ~ diag + age, data = dat)
# Different Values
length(fitted.values(model))
# 2745
nrow(dat)
# 2843
Zheyuan Li
  • 62,170
  • 17
  • 162
  • 226
IJH
  • 157
  • 1
  • 11

4 Answers4

7

There are actually three solutions here:

  1. pad NA to fitted values ourselves;
  2. use predict() to compute fitted values;
  3. drop incomplete cases ourselves and pass only complete cases to lm().

Option 1

## row indicator with `NA`
id <- attr(na.omit(dat), "na.action")
fitted <- rep(NA, nrow(dat))
fitted[-id] <- model$fitted
nrow(dat)
# 2843
length(fitted)
# 2843
sum(!is.na(fitted))
# 2745

Option 2

## the default NA action for "predict.lm" is "na.pass"
pred <- predict(model, newdata = dat)  ## has to use "newdata = dat" here!
nrow(dat)
# 2843
length(pred)
# 2843
sum(!is.na(pred))
# 2745

Option 3

Alternatively, you might simply pass a data frame without any NA to lm():

complete.dat <- na.omit(dat)
fit <- lm(death ~ diag + age, data = complete.dat)
nrow(complete.dat)
# 2745
length(fit$fitted)
# 2745
sum(!is.na(fit$fitted))
# 2745

In summary,

  • Option 1 does the "alignment" in a straightforward manner by padding NA, but I think people seldom take this approach;
  • Option 2 is really simple, but it is more computationally costly;
  • Option 3 is my favourite as it keeps all things simple.
Zheyuan Li
  • 62,170
  • 17
  • 162
  • 226
3

I use a simple for loop. The fitted values have an attribute (name) of the original row they belonged to. Therefore:

for(i in 1:nrow(data)){
  data$fitted.values[i]<-
    fit$fitted.values[paste(i)]
}

"data" is your original data frame. Fit is the object from the model (i.e. fit <- lm(y~x, data = data))

izk9
  • 31
  • 3
1

My answer is an extension to @ithomps solution:

for(i in 1:nrow(data)){
  data$fitted.values.men[i]<- ifelse(data$sex == 1, 
    fit.males$fitted.values[paste(i)], "NA")
  data$fitted.values.women[i]<- ifelse(data$sex == 0, 
    fit.females$fitted.values[paste(i)], "NA")
  data$fitted.values.combined[i]<- fit.combo$fitted.values[paste(i)]
}

Because in my case I ran three models: 1 for males, 1 for females, and 1 for the combined. And to make things "more" convenient: males and females are randomly distributed in my data. Also, I'll have missing data as input for lm(), so I did fit <- lm(y~x, data = data, na.action = na.exclude) to get NAs in my model-object (fit).

Hope this helps others.

(I found it pretty hard to formulate my issue/question, glad I found this post!)

0

If you do not want to change the raw data. Try this way, it's really simple.

names(fitted.values(model)) are data's rownames of available observations, and we can use this feature to add new column:

dat[names(fitted.values(model)), "fitted.values"] <- fitted.values(model)
sum(!is.na(dat[, "fitted.values"]))
# [1] 2745
Matt
  • 11
  • 1