I have an application that is predicting height based on sex and DNA mutations. As height is very different depending on sex, sex is a variable with a strong effect in my prediction (~13). On the contrary, I have many DNA mutations (say 500,000 variables where maybe 20,000 have an effect) with small effects (no more than e.g. 0.5).
When I fit a linear model with lasso penalty on this, only 7 variables are used in the model (on top of sex), and the prediction is quite bad. It's like lasso couldn't find small effects because of the large effect of sex. The solution I found to overcome this problem is to fit a linear model based on sex only and then use lasso on residuals of the previous model. This solution works very well but I lack some understanding on why I need to do this (fit the large effects first, then the small effects on residuals). Any published reference or explanation on this would be really appreciated.
I tried to reproduce this problem with simulations in R. It's not that convincing but it partially shows the problem I'm facing.
### Data simulation
set.seed(1)
N <- 500
M <- 2000
MC <- 100
X <- matrix(rnorm(N * M), N)
set <- sample(M, MC)
eff <- rnorm(MC - 1)
# Many small effects and one large effect
y <- X[, set[-1]] %*% eff + X[, set[1]] * 10
# Add some noise
y <- y + rnorm(N, sd = sd(y) / 20)
Prediction with lasso on all variables
library(glmnet)
ind.train <- sample(N, 0.8 * N)
ind.test <- setdiff(1:N, ind.train)
CV <- cv.glmnet(X[ind.train, ], y[ind.train])
mod.lasso <- glmnet(X[ind.train, ], y[ind.train], lambda = CV$lambda.1se)
pred.lasso <- predict(mod.lasso, X[ind.test, ])
plot(pred.lasso, y[ind.test], pch = 20); abline(0, 1, col = "red")
sqrt(mean((pred.lasso - y[ind.test])^2)) # RMSE: 3.1
Prediction with linear model using only the variable with strong effect
mod.lm <- lm(y ~ X, data = data.frame(y = y[ind.train], X = X[ind.train, set[1]]))
summary(mod.lm)
pred.lm <- predict(mod.lm, data.frame(y = y, X = X[, set[1]]))
plot(pred.lm[ind.test], y[ind.test], pch = 20); abline(0, 1, col = "red")
sqrt(mean((pred.lm[ind.test] - y[ind.test])^2)) # RMSE: 9.2
Prediction with lasso on residuals of linear model
CV.resid <- cv.glmnet(X[ind.train, ], y[ind.train] - pred.lm[ind.train])
mod.lasso.resid <- glmnet(X[ind.train, ], y[ind.train] - pred.lm[ind.train],
lambda = CV.resid$lambda.1se)
pred.lasso2 <- pred.lm[ind.test] + predict(mod.lasso.resid, X[ind.test, ])
plot(pred.lasso2, y[ind.test], pch = 20); abline(0, 1, col = "red")
sqrt(mean((pred.lasso2 - y[ind.test])^2)) # RMSE: 2.5
sexin the model without penalization. In R, with glmnet, there is an argumentpenalty.factoryou can use for this purpose. (Set it to 0 forsex). Then tell us how that worked. My guess is that will be better than your method of fitting with lasso to residuals from first model withsexonly. – kjetil b halvorsen Oct 05 '18 at 13:10lm(height ~ sex)first, this gives me residuals, and then I use lasso on these residuals. See the R code I provided for something similar to what I'm doing. – F. Privé Oct 05 '18 at 17:59