Help with Classification model for S&P500

Question

I have started a project in order to develop my coding skills, where I am predicting next month's S&P500 return direction based on some macroeconomic and financial variables. These datasets have monthly data.

I have downloaded the data, cleaned and applied some transformations: -For financial variables I have taken the pct change and for macro variables I have taken the diff or pct change. I added a screenshot of the distributions from my dataset.

I split my returns into 2 categories which are the log diff from S&P 500 returns. Then I classified negative returns as 0 and positive returns as 1. I also create a variable, weight that it the absolute value of the return, which I would use as input for xgboost. It is my understanding the this weight variable will penalise my algorithm more for wrongly predicting these "large" movements.

Is there a common approach in how to deal with these type of data transformation? Like, does it need to be standardised?

Asking five questions in the same post is not really encouraged here. Can you edit to reduce the number and then ask another question when the first one has been answered? — mdewey, Nov 18 '23 at 15:56
I’ve given some broad advice in my answer, but I agree that this needs to be more focused to fit on here. My answer is really more of an extended comment. — Dave, Nov 18 '23 at 17:30
@mdewey sorry mate, didn't know that. I edited already, cheers. — user199, Nov 18 '23 at 20:11
trying to predict SPX is probably the worst learning project you could come up with. it's basically the same as picking perpetual motion engine for the first Physics project. pick something that is actually possible to predict. — Aksakal, Nov 19 '23 at 13:40

Dave · Accepted Answer · 2023-11-19T03:50:56.010

There’s a huge issue in what you’re doing that need to have attention drawn to it.

I also created a weight variable which is the absolute value of the return, which I would use as input for xgboost.

If you have the information to calculate this, you already have the information to calculate the sign, as you don’t get to know any of the needed information until you observe the stock price. That is, you can’t predict tomorrow’s sign until you know that return, at which point, you know the sign of the return.

(From the comments, this seems to be used to weight the errors, rather than being an input feature that can’t be available when it comes time to make truly new predictions. I still have some issues with predicting signs of returns, but this is a clever way to partially allay some concerns of mine.)

The next point I’ll make is that ROCAUC has advantages over accuracy, but ROCAUC only deals with ranking. ROCAUC does not concern itself at all with how well the predicted probability values align with the reality of how often gains or losses really do happen. For that, common metrics are log loss and Brier score, each of which can be normalized as McFadden and Efron $R^2$, respectively.

Finally, I have seen others try to predict the signs of financial returns, seeming to take it as self-evident that being able to do so must mean trading riches, and seeing sign predictions instead of return predictions as being a way around markets being (mostly) efficient. From my perspective, you lose an enormous amount of valuable information by discarding the magnitude and only looking at the sign. For instance, when you make a trade based on the predicted sign, you incur a trading fee. If the return magnitudes are small, even if you correctly predict the sign, you could wind up losing money because the trading fees outweigh your returns (strong accuracy/ ROCAUC/Brier/etc yet you lose money). However, you will make mistakes in your sign predictions, and a run of gains due to good sign predictions can be offset by one big mistake. Your goal isn’t even to make money. You have to beat simple benchmarks like an S&P 500 index fund, and even if your returns are comparable, you have to consider the volatility in those returns and if it is worth incurring the uncertainty in how investing based on your model predictions.

Whether or not this will reliably work is not clear to me, but, philosophically, I think the right machine learning approach to investing comes from reinforcement learning: use the machine learning to create a strategy (the RL “policy”) that maximizes some kind of financial criterion like expected return or Sharpe ratio. This would combine both predictive modeling and what to do with the predictions. When you just predict the sign, you deny the trader such important information about what to do with that prediction, and naïvely buying/holding when signs are predicted to be positive and selling/staying out when signs are predicted to be negative is not necessarily a good strategy.

thanks for replying. I am also using Brier score, although I haven't tried log loss. I will give it a go Also, I completely agree that predicting the sign doesn't necessarily lead you to make a good trade, if you take into account transaction costs, slippage costs, etc... At the end of the day this exercise is mostly to develop some skills. The way I tried to address the magnitude of the returns, was passing on the weights to the model, so it would have a higher penalty for wrongly predicting these large movements... Only problem there, is that some of these large moves we caused by outliers — user199, Nov 18 '23 at 20:14
@user199 What is an “outlier” in your mind? To me, that sounds like a return that will heavily influence your profits. — Dave, Nov 19 '23 at 03:58
in my mind outliers are observations that are abnormal in my dataset, and in my case infrequent. So my model shouldn't be learning from these points, as these are very infrequent cases. In my case I have covid, financial crisis and dot com bubble. I used Robust scaler as it is a bit more robust with outliers. However, not sure if that is the issue, but I have been getting really bad ROC AUC and brier score results. — user199, Nov 21 '23 at 16:05
@user199 Why don’t you want to learn from these periods? If you can mail the times when weirdness happens, that seems like a way to get rich! Don’t we all wish we invested like they did in The Big Short? — Dave, Nov 21 '23 at 16:12
yes but it is the hardest possible thing to predict haha but I will leave them there so my model can learn from them. One last thing, do you have any opinion on Timeseries Split from scikit learn or walking forward validation? If I use walk forward val I can't use ROC AUC to measure as I am trying to predict only the next month's return. While with tss you would predict the next k months... — user199, Nov 23 '23 at 17:34
@user199 Those sound like good questions to post on their own. — Dave, Nov 23 '23 at 18:24

Help with Classification model for S&P500

1 Answers1

Linked