Lasso Regression - model predictions are not correct. low r-squared

Question

I am attempting to use Lasso to choose the best variables from a set of 20. I have managed to construct a model using LassoCV, however when using the test data to compare the predicted returns to the actual returns they are very different. Predicted returns are much smaller.

I get a low MSE, however i get a negative R^2. My R^2 is for the actual returns against the predicted returns.

The diagram below shows the predicted returns via Lasso model and the actual returns. Green is the PREDICTED.

Am I doing something wrong upon constructing this model? Model construction:

best_model = LassoCV(alphas=alphas_lasso,cv=5, max_iter=1000).fit(predictors, returns.values.ravel())
returns_pred = best_model.predict(predictors)

Just to echo kjetil's comment, you say "return" in your post -- if you're modeling financial returns, these are notoriously noisy and difficult to predict for many reasons (if it were easy to predict, you'd have a model that let you print money!). — Sycorax, Apr 12 '16 at 18:21
This is a project and the 20 variables were provided. I have to construct 3 candidate models, so how would it be possible to find such models if the predicted values are nowhere near the actual? and r^2 is negative? — Jlearner, Apr 12 '16 at 18:29
You have a candidate model, it just isn't very good. Perhaps basis expansions or data transformations are in order. A good place to start is Harrell, Regression Modeling Strategies. — Sycorax, Apr 12 '16 at 18:31
The other comments are correct.
While it's now 7 years later, if you are still reading, it would help to show other plots. A scatter plot of predicted vs. actual might be good. — Peter Flom, Aug 14 '23 at 00:11

Dave · Answer 1 · 2023-03-13T11:05:38.740

Depending on the calculation, out-of-sample $R^2$ can be negative. In fact, for LASSO, even in-sample $R^2$ can be negative (again, depending on the calculation).

If you do the $R^2$ calculation by squaring the Pearson correlation between the predictions and true values, that is bounded below by zero (cannot be negative). However, I give below another common way to express $R^2$ that is likely used by your software.

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

In the OLS linear regression case, this turns out to be equivalent to to squaring the Pearson correlation between the predicted and true values. Also, in that OLS simple linear regression case (just a slope and an intercept), the above equation in equivalent to squared Pearson correlation of the outcome $y$ and lone feature $x$. However, the above equation for $R^2$ allows for the usual interpretation as the "proportion of variance explained" by the regression.

All of this is to say that the above equation is a totally reasonable way of writing $R^2$.

In order for such a formula to give a negative number, the fraction numerator must exceed the denominator. Digging into the fraction, the numerator is the sum of squared residuals for your model, and the denominator is the sum of squared residuals for a model that predicts $\bar y$ every time, regardless of the feature/covariate values. Such a model can be regarded as a reasonable naïve baseline "must-beat" model: if you want to predict the conditional expected value and know nothing about how the features influence $y$, what better prediction than the mean of $y$ every time?

Consequently, when you get that formula to give a value less than zero, that is a signal that your predictions are doing worse in terms of square loss (sum of squared residuals) than your baseline, "must-beat" model. Given that you aim to predict financial returns that are notoriously difficult to predict, poor model performance is not surprising. Looking at your graph, you see that the green line of predictions is far away from the blue line of true values, consistent with poor model performance. A useful visualization might be a scatterplot of true and predicted values. I have another answer where I show plots like this and why they can show strong correlation yet make terrible predictions. Depending on the mistakes your model makes, you might be able to calibrate the predictions (such as swapping negative and positive predicted returns, if the model consistently gets the wrong sign), though this warrants a separate question and answer, discussed to some extent in a question of mine from about a year ago and the comment by Stephan Kolassa.

Overall, it seems that your LASSO model simply does a poor job or making predictions. This is disappointing, sure, but you want to catch performance like this before you deploy a model. After all, I would not want to trust my life's savings to an investment plan that uses a model that makes such poor predictions, and I will venture a guess that I am not alone in feeling that way!

(There is an issue when it comes to what the denominator should be when you do out-of-sample assessments, and I disagree with typical software implementations, such as that of sklearn; see below for the equations, noting that there is no disagreement for the in-sample case. Fortunately, however, my way of doing it that uses the in-sample $\bar y$ and the sklearn way of doing it with the out-of-sample $\bar y$ are likely to give similar denominators (since the in-sample and out-of-sample means should be fairly close, unless there is data drift (which is not so unusual), but that is a separate issue), so if you get $R^2<0$ one way, you are likely to get $R^2<0$ the other way.)

$$ R^2_{\text{out-of-sample, Dave}}=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y_{\text{in-sample}} \right)^2 }\right) $$

$$ R^2_{\text{out-of-sample, sklearn}}=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y_{\text{out-of-sample}} \right)^2 }\right) $$

Lasso Regression - model predictions are not correct. low r-squared

1 Answers1