Let's first remember what $R^2$ and adjusted $R^2_{adj}$ mean in linear regression.
$$
R^2=1-\left(\dfrac{
\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\hat y_i
\right)^2
}{
\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\bar y
\right)^2
}\right)=1-\left(\dfrac{\left(
\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\hat y_i
\right)^2\right)\bigg/N
}{
\left(\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\bar y
\right)^2\right)\bigg/N
}\right)\\
R^2_{adj} = 1-\left(\dfrac{\left(
\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\hat y_i
\right)^2\right)\bigg/\left(N-p-1\right)
}{
\left(\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\bar y
\right)^2\right)\bigg/\left(N-1\right)
}\right)
$$
(Above, $p$ is the number of non-intercept elements of the parameter vector, and I am comfortable assuming there is an intercept.)
The $R^2$ compares the square loss experienced by the model to the square loss experienced by a naïve model that always predicts the overall mean of the outcome, $\bar y$, no matter the feature values. Then adjusted $R^2_{adj}$ gives a penalty for including many features in the model, increasing the fraction numerator as $p$ increases, all else equal, which decreases the $R^2_{adj}$. This $N-p-1$ comes from the derivation of an unbiased estimator of the error variance, and the $N-1$ comes from the derivation of an unbiased estimator of the total variance.
In a neural network, I can get on board with $\dfrac{\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\bar y
\right)^2}{\left(N-1\right)}$ being unbiased for the total variance. After all, this is just the usual (unbiased) variance calculation that we knew was unbiased long before deep learning existed. However, I have a tough time coming up with an unbiased estimator for the error variance. If you just count the number of parameters in the neural network, that is (probably) too high to use for $p$ and might even exceed the number of observations, but I am not sure how low to drop $p$ below the number of parameters. Further, fiddling with cross-validation or parameter tuning will affect this calculation.
Therefore, I wouldn't try to write a closed-form equation for something $R^2_{adj}$-like for your LSTM network. I would go back to the original idea of $R^2$ and compare model performance to the performance of a naïve model. Without the time component, I would say that the naïve model would be one that always predicts the in-sample mean. With the time component, it seems philosophically consistent to use the historical mean, which means that you calculate the mean of all data points prior to the time when you are making the forecast, and that is the prediction of the naïve model. Then the performance of the naïve model, in terms of square loss, is calculated to serve as the denominator, while the numerator is the square loss of your LSTM model, calculated normally. This is consistent with how Campbell & Thompson (2008) approach $R^2$ in time series forecasting, discussed here.
In the simulation below, I have a historical time series of ten points. Then I predict the next ten.
library(ggplot2)
set.seed(2024)
N <- 10
yt <- rbeta(N+10, 1, 1) # True values for the 20-sample time series
yf1 <- rnorm(10, 1/2, 0.1) # Random forecasts of the final ten values
yf2 <- yt[(N + 1):(N + 10)] + rnorm(10, 0, 0.1) # Other simulated forecasts
num1 <- num2 <- den <- rep(NA, 10)
for (i in 1:10){
yt_history <- yt[1:(N + i - 1)] # All values prior to the current period
historical_mean <- mean(yt_history) # Calculate the historical mean
den[i] <- (yt[N + i] - historical_mean)^2 # Calculate error when the
# historical mean is predicted
num1[i] <- (yt[N + i] - yf1[i])^2 # Calculate error from predictions yf1
num2[i] <- (yt[N + i] - yf2[i])^2 # Calculate error from predictions yf2
}
print(1 - sum(num1)/sum(den)) # With random predictions, this is worse than
# always predicting the historical mean
print(1 - sum(num2)/sum(den)) # With predictions deviating from the true values
# only slightly, this calculation shows the
# predictions to be quite good.
d0 <- data.frame(
Time = seq(1, 10, 1),
Value = yt[(N + 1):(N + 10)],
Series ="True Time Series"
)
d1 <- data.frame(
Time = seq(1, 10, 1),
Value = yf1,
Series ="Random Predictions"
)
d2 <- data.frame(
Time = seq(1, 10, 1),
Value = yf2,
Series ="Good Predictions"
)
d <- rbind(d0, d1, d2)
ggplot(d, aes(x = Time, y = Value, col = Series)) +
geom_line() +
scale_color_manual(values = c("blue", "red", "black"))

From the plot, we can see that the red predictions tend to be far away from the true values in black, and the red predictions give an $``R^2"$ below zero (I get $-0.1605572$). However, the blue predictions are rather close to the true values in black, and I get a rather high $``R^2"$ of $0.9579874$.
Therefore, I would calculate this way. It is from the same philosophy as adjusted $R^2$ in that it will penalize model complexity that does not lead to generalizable performance (that's what the out-of-sample forecasting does) and is a comparison to some kind of naïve model, yet it does not require an answer to the complex question of what the degrees of freedom is in an LSTM neural network the way we have an easy formula in linear regression.
REFERENCE
Campbell, John Y., and Samuel B. Thompson. "Predicting excess stock returns out of sample: Can anything beat the historical average?." The Review of Financial Studies 21.4 (2008): 1509-1531.