7

I have trained a couple of models which I'm experimenting with. One is Logistic Regression and the other Random Forest. I've got 10s of 1000s of samples in my dataset (which has 4 features) and I've experimented with how many samples throw up the best out-of-sample test accuracy. I have done 10 k-fold validation, and some gridsearch optimisation of hyperparameters ... and I'm consistently(**) getting about 82% accuracy predicting on test data. I am splitting my dataset 70:30, training on the 70% and then testing on the unseen 30%. Both models give me roughly 82% accuracy predicting on test data. I was thinking this was a good result and because k-fold validation is giving me a nice accuracy, I am not overfitting or underfitting. But, I must be ...

... when I try predicting on new data samples captured very soon after I train the model ... I am getting nowhere near 82% accuracy. In fact, I'm getting less than 40% success rate when I compare my model prediction with what outcome actually transpires.

So I guess my model does not generalise well. Where can I go from here? I would like to first of all confirm what the problem is exactly. Is the 82% accuracy misleading? How can my live results be so much worse? Could it be that the 4 features are simply not good enough? In which case how can I get 82% accuracy in testing? Are there tests that I can do on the model(s) to gain insights for further work?

(** I retrain the model quite often as new data comes in realtime)

brownie74
  • 133
  • 1
    if your training different is from a different joint distribution than the data to which you apply the model then nothing will work – Aksakal Nov 23 '21 at 18:03
  • @Aksakal Do you mean if the standard deviation and mean of the out-of-sample stuff is different to the training stats, then it won't work? This is stationarity right? My data is stationary, or should be. Maybe i should do realtime checks to make sure the standard deviation and mean are the same in the out-of-sample data as in the training data, and only if that condition holds, will i use the model to predict on the out-of-sample data. Does that make sense? – brownie74 Nov 23 '21 at 18:35
  • I mean it beyond the mean and variance. These two fully define normal distribution, but your dataset may come from any distribution, which we usually don't know what it is. So, it's more of a general statement. Imagine you're building blood oxygen meter app for Apple watch, and got the data to train on from USA Olympic team athletes summer camp, because the measurements are easily available. Will this app generalize to ordinary watch users? It may not work for obese people, which is probably 80% of its users. It may not work for children or elderly, or people from South Africa etc. – Aksakal Nov 23 '21 at 18:40
  • @Aksakal Right, so is there some way i can see how close two distributions are? I think a t-test would do, wouldn't it? I can see if my out-of-sample stuff comes from same distribution as my training data to a certain confidence interval? Or some other way? – brownie74 Nov 23 '21 at 18:42
  • there is no universal test. if somehow you know that your data comes from a parametric distribution, such as normal, then you can estimate the parameters on training and forecast samples and compare. in general case, you're out of luck, and often will end up simply comparing the forecast errors vs model (training) errors, then making conclusion based on the difference – Aksakal Nov 23 '21 at 19:00
  • @Aksakal If cannot really wait for enough forecast samples to be able to calculate a distribution. It will take too long. But I am thinking that what i can do is work out what are the measurable factors that impact the shape of the distribution and detect for changes in those factors. If those factors have recently changed then I can bet that the forecast distribution as it unravels, would be a different shape than than test distribution, maybe. And then I can choose not to forecast until normal behaviour is resumed. Maybe. – brownie74 Nov 24 '21 at 07:40

3 Answers3

12

It's hard to say without digging deeply into your model and your data. However, it seems like you have been doing a lot of cross-validation, model tuning, cross-validation, model tuning and so forth. That, together with bad out-of-sample performance, suggests that you are overfitting to your test set. That is harder than overfitting in-sample (which is easy indeed), but it is quite possible to do. Essentially, if this is the problem, then your repeated model tuning cycles simply fitted it to the idiosyncrasies of the full dataset.

As to what to do now: you should dig into your data. Did anything change drastically between the training and the testing data? Are there any strong predictors in the new data which did not show up as strongly in the training data? Stuff like that. But remember that the more you tweak your model, the more likely you are to overfit, so proceed with caution.

Incidentally, you should be able to get at least 50% accuracy by always predicting the majority class in your holdout dataset, assuming you can identify this class beforehand. Thus, an accuracy of only 40% is a big red flag. It looks like something has changed in a major way. (Also, this simple benchmark is one reason why accuracy is not a good evaluation measure.)

Stephan Kolassa
  • 123,354
  • The problem is not in the testing. I gather the data, split 70/30 ... i train on the 70%. Then i test on the 30% ... i get good results. Even with cross validation. The problem comes when i deploy the model on out-of-sample data. When i measure performance there i get much worse performance than in-sample-testing (the 30% test). Somehow i need to make sure the out-of-sample dist is consistent with the model before using the model to make predictions, i think. – brownie74 Nov 23 '21 at 19:42
  • Yes, it does sound like there is a qualitative difference between your training/testing and validation set. If this is time-ordered, have you tried looking at a different split in time? If it isn't, have you looked at different ways of splitting it up? – Stephan Kolassa Nov 24 '21 at 06:40
  • @Stephen I have done k-Fold (10 folds), and also experimented a bit with 70/30,80/20, 90/10..and found 70/30 split to be best. I have started at various points in historic time, and built my model from there, to eliminate conditions specific to one point in time. This did not impact accuracy much. – brownie74 Nov 24 '21 at 07:36
0

I second Stephan's answer that the likely culprit is overfitting the entire dataset. That said, another thing to validate is that there are no differences between data processing pipelines in your training vs. production code. E.g. are you normalizing the features before training? If so, do you record the means and standard deviations and apply the same normalization to live data?

  • @ andersource I do exactly as you say regarding the means/stdev .. i calculate them during building the model .. store them and use them when processing the first live feature. Then, after processing one feature, i rebuild the model and calculate a new mean/stdev. Then it goes to sleepuntil the next feature arrives, in about 10 minutes. I have found it best to rebuild the model as often as possible (better in terms of accuracy). Does this sound good to you? – brownie74 Nov 24 '21 at 07:33
  • What do you mean by overfitting the entire dataset? How can that be when my training accuracy is 82% and my testing accuracy is also 82%? If i was overfitting, wouldn't my testing accuracy be much lower than my training accuracy? Just to re-iterate, the problem comes when i do live testing on a sample which has an unknown label. My training/testing is done on historic data where i know what actually happened and can calculate prediction accuracy (70/30). This is standard stuff , i think. – brownie74 Nov 24 '21 at 07:33
  • @brownie74 The process you are describing sounds good. Are there other preprocessing steps? E.g. 1-hot encoding or other transformations? Also, are you calculating the metrics on the full dataset or just on the training set? If you're using the full dataset that might be a small source of leakage, albeit probably not that impactful. – andersource Nov 24 '21 at 08:00
  • I calcualte mean/stdev on the training set only, for the reasons you state (leakage) – brownie74 Nov 24 '21 at 08:02
  • How much new data arrives every 10 minutes? Also, what happens if you use an extremely basic model (e.g. Naïve Bayes) - does it perform similarly on the training (+ test) data as on live data? Another question, are there any features that would be in a different domain in the new data, e.g. time or serial number etc.? – andersource Nov 24 '21 at 08:05
  • Just 1 new sample arrives every 10 minutes and it's the exact same structure as the training/test data, just four metrics (features) calculated from the new row and 20 previous rows (weighted average). It's a good question regarding the model, I have tried RF and LR with very similar results. I could switch in NB, I will give it a try. – brownie74 Nov 24 '21 at 09:11
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Community Nov 24 '21 at 11:15
0

To me it seems to be a data problem. You are splitting the data 70:30, but are all data from the 70% prior to data from the 30% set?

It can be a problem if you mix older and newer data in the training set. If time is involved in the generation of data, which seems to be the case as you have live data, test set should never contain data that were generated before those of the training set.

  • No, for sure the data is sequential in time order. The data it timestamped according to when it was originated and I am 100% sure it is not being shuffled. Even during cross validation. – brownie74 Nov 24 '21 at 10:13