4

I have an application that is predicting height based on sex and DNA mutations. As height is very different depending on sex, sex is a variable with a strong effect in my prediction (~13). On the contrary, I have many DNA mutations (say 500,000 variables where maybe 20,000 have an effect) with small effects (no more than e.g. 0.5).

When I fit a linear model with lasso penalty on this, only 7 variables are used in the model (on top of sex), and the prediction is quite bad. It's like lasso couldn't find small effects because of the large effect of sex. The solution I found to overcome this problem is to fit a linear model based on sex only and then use lasso on residuals of the previous model. This solution works very well but I lack some understanding on why I need to do this (fit the large effects first, then the small effects on residuals). Any published reference or explanation on this would be really appreciated.

I tried to reproduce this problem with simulations in R. It's not that convincing but it partially shows the problem I'm facing.

### Data simulation
set.seed(1)
N <- 500
M <- 2000
MC <- 100
X <- matrix(rnorm(N * M), N)
set <- sample(M, MC)
eff <- rnorm(MC - 1)
# Many small effects and one large effect
y <- X[, set[-1]] %*% eff + X[, set[1]] * 10
# Add some noise
y <- y + rnorm(N, sd = sd(y) / 20)

Prediction with lasso on all variables

library(glmnet) ind.train <- sample(N, 0.8 * N) ind.test <- setdiff(1:N, ind.train) CV <- cv.glmnet(X[ind.train, ], y[ind.train]) mod.lasso <- glmnet(X[ind.train, ], y[ind.train], lambda = CV$lambda.1se) pred.lasso <- predict(mod.lasso, X[ind.test, ])

plot(pred.lasso, y[ind.test], pch = 20); abline(0, 1, col = "red") sqrt(mean((pred.lasso - y[ind.test])^2)) # RMSE: 3.1

Prediction with linear model using only the variable with strong effect

mod.lm <- lm(y ~ X, data = data.frame(y = y[ind.train], X = X[ind.train, set[1]])) summary(mod.lm) pred.lm <- predict(mod.lm, data.frame(y = y, X = X[, set[1]]))

plot(pred.lm[ind.test], y[ind.test], pch = 20); abline(0, 1, col = "red") sqrt(mean((pred.lm[ind.test] - y[ind.test])^2)) # RMSE: 9.2

Prediction with lasso on residuals of linear model

CV.resid <- cv.glmnet(X[ind.train, ], y[ind.train] - pred.lm[ind.train]) mod.lasso.resid <- glmnet(X[ind.train, ], y[ind.train] - pred.lm[ind.train], lambda = CV.resid$lambda.1se) pred.lasso2 <- pred.lm[ind.test] + predict(mod.lasso.resid, X[ind.test, ])

plot(pred.lasso2, y[ind.test], pch = 20); abline(0, 1, col = "red") sqrt(mean((pred.lasso2 - y[ind.test])^2)) # RMSE: 2.5

F. Privé
  • 231
  • When you say the prediction is quite bad, you mean compared to a normal linear regression without penalty? 500k variables is a lot, there must be tons of them that are correlated and redundant? how many observations do you have? Maybe you can try a lower L1 penalty or try elastic-net regression instead. And I am not sure what you mean by first fitting a linear model and get the residuals and use Lasso on residuals... – Tom Oct 05 '18 at 12:48
  • 1
    You can try to include sex in the model without penalization. In R, with glmnet, there is an argument penalty.factor you can use for this purpose. (Set it to 0 for sex). Then tell us how that worked. My guess is that will be better than your method of fitting with lasso to residuals from first model with sexonly. – kjetil b halvorsen Oct 05 '18 at 13:10
  • 2
    an intuitive answer would be: When you're doing regression with lots of variables relative to the number of rows in your data, you need to regularise because it's a safe assumption that most coefficients will be zero and Lasso helps you pick the important ones out. It does however bias all of your coefficients to zero, which you absolutely don't want for sex. Setting the sex penalty factor to zero, training a linear regressor without penalisation and then lasso on the residuals, or just two different models (seeing as sex isn't continuous anyway so data won't be spare) are all viable options – gazza89 Oct 05 '18 at 13:24
  • @Tom I fit the model on ~390,000 observations. I tried elastic-net with alpha = 0.1, 0.01 and 0.001, and lasso gives the best results. Basically, I use lm(height ~ sex) first, this gives me residuals, and then I use lasso on these residuals. See the R code I provided for something similar to what I'm doing. – F. Privé Oct 05 '18 at 17:59
  • @kjetilbhalvorsen I use a special implementation of lasso that doesn't allow me to set different penalty factors, unfortunately. But I'll try to do that on the example I provided. – F. Privé Oct 05 '18 at 18:00
  • Using no penalty for the large effect in lasso gives similar results than adjusting first for this large effect. – F. Privé Oct 06 '18 at 06:31

0 Answers0