5

I need some help to understand the relationship between the ranking of the variables from the LARS algorithm and the use of OLS to estimate the final model chosen by the LARS.

I understand that the LARS algorithm is less greedy than forward stepwise regression because it does not require additional predictors to be orthogonal to the residual and the already included predictor. But after the LARS has ranked the variables and chosen the optimal number of predictors to include in the model, we use OLS to estimate the model. The OLS parameters are different from those assigned to the predictors in the LARS, right? So how can I intuitively explain why it is correct to first use LARS and then OLS on the model selected by LARS?

amoeba
  • 104,745
Guest
  • 51

1 Answers1

9

The coefficient estimates from LARS will be shrunk (biased) towards zero, and the intensity of shrinkage might be suboptimal (too harsh) for forecasting.

However, some shrinkage should be good, as there is a trade-off between bias and variance. For example, if lasso happens to have selected the relevant regressors and only them (which of course is never guaranteed in practice), you could show that a positive (thus nonzero) amount of ridge-type shrinkage is optimal* -- just as you can show it in a basic linear model with no variable selection (see e.g. the answer by Andrew M in the thread "Under exactly what conditions is ridge regression able to provide an improvement over ordinary least squares regression?"). (I do not know if you can show this for LARS-type shrinkage, but intuitively I would not expect zero shrinkage to always be optimal.)

This is what motivates (1) relaxed lasso (Meinshausen, 2007) where there are two shrinkage parameters: a harsher one for variable selection and a softer one of the coefficients of the retained variables); or (2) LARS-OLS where there is no shrinkage on the coefficients of the retained variables.

*Optimal in the sense that it minimizes the mean squared error of the estimator

Meinshausen, Nicolai. "Relaxed lasso." Computational Statistics & Data Analysis 52.1 (2007): 374-393.

Richard Hardy
  • 67,272
  • could you clarify what you mean by "optimal"? In particular, are you referencing a theorem you could point me towards? – user795305 Jun 15 '17 at 14:03
  • 1
    @Ben, thanks for your remark. I have now explained the sense in which the estimator is optimal. I have also included a link to Dave Giles' blog post for a broader introduction to the topic. – Richard Hardy Jun 15 '17 at 14:15
  • While I'm hoping this isn't too nitgritty of an objection: it doesn't appear that what you link to supports what you're saying. From my understanding, the blog post is motivation for james stein estimation in terms of regression. On the other hand, you're talking about ridge regression. These are different types of shrinkage. To go further, ridge regression depends on a tuning param that needs to be selected in a data driven way. In the presence of tuning, I'm not aware of any theory whatsoever regarding the risk of the resulting estimator. – Ben 7 mins ago – user795305 Jun 15 '17 at 15:05
  • Indeed, Hastie + Efron recently wrot ein ch 7.3 of Computer age statistical inference that "The main point here is that at present there is no optimality theory for shrinkage estimation. Fisher provided an elegant theory for optimal unbiased estimation. It remains to be seen whether biased estimation can be neatly codified." Please let me know if I'm not thinking about this right! – user795305 Jun 15 '17 at 15:06
  • @Ben, good points. I did not say we know or can accurately estimate what amount of shrinkage is optimal. What I said is that the optimal amount (unknown) is positive. I thought there was a proof in Dave Giles' blog posts somewhere (perhaps this or this). If not there, it should not be too difficult to prove by oneself. – Richard Hardy Jun 15 '17 at 15:54
  • @Ben, I think I have found another source locally at CV. Please take a look. And it looks like James-Stein and ridge are the same thing written differently: "Unified view on shrinkage: what is the relation (if any) between Stein's paradox, ridge regression, and random effects in mixed models?". – Richard Hardy Jun 15 '17 at 16:02
  • LASSO is often viewed as a method to approximate best-subset selection. If you wanted best-subset selection and you did LASSO, then you naturally view LASSO as giving you a suboptimal solution in that the coefficients are shrunk relative to what best-subset would have given you. Following up with OLS on the selected covariates is the simplest way to compensate. – Paul Jun 15 '17 at 16:16
  • @Paul, but to be fair, LASSO is typically tuned (by cross validation) to optimize forecasting performance, which is not the same as tuning for optimal variable selection (adaptive LASSO could be better there). When tuned for optimal forecasting performance, it is only natural to use the LASSO estimate as is, without going for OLS after LASSO. Otherwise you should also cross validate accordingly: not only LASSO, but LASSO+OLS. (This has been mentioned in previous posts as well.) – Richard Hardy Jun 15 '17 at 16:32
  • I agree that it wouldn't be consistent to cross-validate one procedure and then use another as the final model. – Paul Jun 15 '17 at 17:27
  • Thanks for comments. I want to make it clear that I am not talking about the LASSO modification of the LARS, I only use the LARS algorithm to rank the variables and then select the subset of ranked variables to use in the final predictive model according to an information criterion. To be more precise, my question is the following: When LARS has given me the final model (top ranked predictors and how many of them to include), what is the consequence of using OLS to estimate this predictive model instead of using the LARS estimated parameters? I hope this is more clear. – Guest Jun 15 '17 at 17:41
  • @Guest, I think everything above still holds if we replace lasso with LARS. Or is there anything in particular you would like to be addressed? – Richard Hardy Jun 15 '17 at 17:46
  • I am sorry if I am unclear. In particular, I wonder how the third section in your answer is motivated. You say that the LARS may yield too much shrinkage, but some shrinkage might be good. But how is this related to the LARS-OLS? Is it just that the LARS-OLS will "correct" for the too large shrinkage produced by the LARS? But how can we know whether the shrinkage is too large in the first place? – Guest Jun 15 '17 at 17:58
  • @Guest, I am not an expert, but as far as I understand, the shrinkage of LASSO and thus probably also LARS tends to be too harsh on the coefficients of the truly relevant regressors. OLS fixes it, but overdoes it, too, as some shrinkage is needed. Why the LARS shrinkage tends to be too large? I don't know. Perhaps that can be found in the relaxed lasso paper. – Richard Hardy Jun 15 '17 at 18:07
  • @RichardHardy The 1970 paper that you included a link to in your edit was interesting! Also, I think that it important to note the james stein and ridge are not the same in a regression setting. – user795305 Jun 15 '17 at 19:04
  • 1
    @Ben, thanks for your input. You probably know more than me about these things. Feel free to edit my answer to fix any remaining factual mistakes. Or add your own answer, if you prefer. – Richard Hardy Jun 15 '17 at 19:23
  • Okay, thanks for talking with me about this stuff! – user795305 Jun 16 '17 at 01:42