Adjusted R2 for LSTM

Question

Background: I am working on a problem, where I am making predictions for a time-series data. I am considering two approaches:

Use LSTM, predict n samples using recursive strategy (suggested e.g. in here: How to make LSTM predict multiple time steps ahead?)
Do not treat data as a time series, build linear regression model using independent variables.

I want to evaluate results using adjusted R2 (due to using different number of inputs in both types of models).

Question: in case of LSTM, what is the number of independent variables (in adjusted R2 formula), assuming that look_back is 24 and I am using only single time series as input? I bet I shall use 24, but wanted to be sure.

What you are doing is [tag:forecasting]. There are established accuracy metrics for this, take a look at the textbooks and resources described here. $R^2$ is a rather uncommon choice, whether adjusted or not. These threads may be enlightening, — Stephan Kolassa, Jan 12 '24 at 21:39
Thanks @StephanKolassa, your comment made me re-think my approach. — Michał Panek, Jan 17 '24 at 21:16

Dave · Answer 1 · 2024-01-18T15:29:05.473

Let's first remember what $R^2$ and adjusted $R^2_{adj}$ mean in linear regression.

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right)=1-\left(\dfrac{\left( \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2\right)\bigg/N }{ \left(\overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2\right)\bigg/N }\right)\\ R^2_{adj} = 1-\left(\dfrac{\left( \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2\right)\bigg/\left(N-p-1\right) }{ \left(\overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2\right)\bigg/\left(N-1\right) }\right) $$

(Above, $p$ is the number of non-intercept elements of the parameter vector, and I am comfortable assuming there is an intercept.)

The $R^2$ compares the square loss experienced by the model to the square loss experienced by a naïve model that always predicts the overall mean of the outcome, $\bar y$, no matter the feature values. Then adjusted $R^2_{adj}$ gives a penalty for including many features in the model, increasing the fraction numerator as $p$ increases, all else equal, which decreases the $R^2_{adj}$. This $N-p-1$ comes from the derivation of an unbiased estimator of the error variance, and the $N-1$ comes from the derivation of an unbiased estimator of the total variance.

In a neural network, I can get on board with $\dfrac{\overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2}{\left(N-1\right)}$ being unbiased for the total variance. After all, this is just the usual (unbiased) variance calculation that we knew was unbiased long before deep learning existed. However, I have a tough time coming up with an unbiased estimator for the error variance. If you just count the number of parameters in the neural network, that is (probably) too high to use for $p$ and might even exceed the number of observations, but I am not sure how low to drop $p$ below the number of parameters. Further, fiddling with cross-validation or parameter tuning will affect this calculation.

Therefore, I wouldn't try to write a closed-form equation for something $R^2_{adj}$-like for your LSTM network. I would go back to the original idea of $R^2$ and compare model performance to the performance of a naïve model. Without the time component, I would say that the naïve model would be one that always predicts the in-sample mean. With the time component, it seems philosophically consistent to use the historical mean, which means that you calculate the mean of all data points prior to the time when you are making the forecast, and that is the prediction of the naïve model. Then the performance of the naïve model, in terms of square loss, is calculated to serve as the denominator, while the numerator is the square loss of your LSTM model, calculated normally. This is consistent with how Campbell & Thompson (2008) approach $R^2$ in time series forecasting, discussed here.

In the simulation below, I have a historical time series of ten points. Then I predict the next ten.

library(ggplot2)
set.seed(2024)
N <- 10
yt  <- rbeta(N+10, 1, 1)     # True values for the 20-sample time series
yf1 <- rnorm(10, 1/2, 0.1)   # Random forecasts of the final ten values
yf2 <- yt[(N + 1):(N + 10)] + rnorm(10, 0, 0.1) # Other simulated forecasts
num1 <- num2 <- den <- rep(NA, 10) 
for (i in 1:10){
yt_history <- yt[1:(N + i - 1)] # All values prior to the current period
  historical_mean <- mean(yt_history) # Calculate the historical mean
  den[i]  <- (yt[N + i] - historical_mean)^2 # Calculate error when the 
                                             # historical mean is predicted
  num1[i] <- (yt[N + i] - yf1[i])^2 # Calculate error from predictions yf1
  num2[i] <- (yt[N + i] - yf2[i])^2 # Calculate error from predictions yf2
}
print(1 - sum(num1)/sum(den)) # With random predictions, this is worse than 
                              # always predicting the historical mean
print(1 - sum(num2)/sum(den)) # With predictions deviating from the true values
                              # only slightly, this calculation shows the 
                              # predictions to be quite good.
d0 <- data.frame(
  Time = seq(1, 10, 1),
  Value = yt[(N + 1):(N + 10)],
  Series ="True Time Series"
)
d1 <- data.frame(
  Time = seq(1, 10, 1),
  Value = yf1,
  Series ="Random Predictions"
)
d2 <- data.frame(
  Time = seq(1, 10, 1),
  Value = yf2,
  Series ="Good Predictions"
)
d <- rbind(d0, d1, d2)
ggplot(d, aes(x = Time, y = Value, col = Series)) +
  geom_line() +
  scale_color_manual(values = c("blue", "red", "black"))

From the plot, we can see that the red predictions tend to be far away from the true values in black, and the red predictions give an $``R^2"$ below zero (I get $-0.1605572$). However, the blue predictions are rather close to the true values in black, and I get a rather high $``R^2"$ of $0.9579874$.

Therefore, I would calculate this way. It is from the same philosophy as adjusted $R^2$ in that it will penalize model complexity that does not lead to generalizable performance (that's what the out-of-sample forecasting does) and is a comparison to some kind of naïve model, yet it does not require an answer to the complex question of what the degrees of freedom is in an LSTM neural network the way we have an easy formula in linear regression.

REFERENCE

Campbell, John Y., and Samuel B. Thompson. "Predicting excess stock returns out of sample: Can anything beat the historical average?." The Review of Financial Studies 21.4 (2008): 1509-1531.

Adjusted R2 for LSTM

1 Answers1

Linked