2

Is there any number that we can land on for our regression model to predict with high accuracy? (accuracy metrics I have in mind at RMSE or R-squared). Also high accuracy may mean something above 88% to 90% accuracy with 95% confidence interval (I am not glued to any particular number)

In my current setup, we have to run some tests, but the issue is that we cannot collect a very large amount of dataset due to the limitation of time in running a single test. We have concluded that we will run around 80 tests (by calculating the avg. amount of time each test would take, plus the time needed to change the configuration etc.).

However, I am not convinced that this much amount of data points would sufficiently give us an accurate model.

Plus, for each test, we are running the test once only: My concern with this that the outcome we may have may be fluke. How many runs per test would nullify the chances of fluke statistically speaking?

SJa
  • 534
  • 2
    What is your criterion of accuracy any way? For example, if the criterion is predict values exactly, most regression-type models fail absolutely. – Nick Cox Nov 23 '22 at 15:37
  • I aim to use RMSE and R-squared to measure accuracy. – SJa Nov 23 '22 at 15:40
  • 2
    OK, but RMSE 90% of SD(y) and R-square of 90% are still different criteria. – Nick Cox Nov 23 '22 at 15:55
  • RMSE is good. but I'm afraid my point still stands. $R^2$ as an evaluation measure has the problem that a systematically biased prediction can have a high $R^2$ - if you always predict $\hat{y}=y-c$, i.e., underpredict by a fixed amount, you still get $R^2=1.0$. And the predictability of a numerical target is still very hard to assess because of residual variation. Without looking at your particular situation, it makes no sense to posit some specific target accuracy. – Stephan Kolassa Nov 23 '22 at 15:56
  • 1
    Two could work -- and millions might fail to be sufficient. At the very least you need to indicate how the prediction relates to the data, what the model is, how you measure accuracy, and what the standard errors are in your model. – whuber Nov 23 '22 at 19:03
  • Whether an accuracy of 88-92% is "reasonable" also depends very much on the situation. A temperature forecast five minutes ahead may reach a certain accuracy threshold, but a temperature forecast five days ahead will have a much harder time. – Stephan Kolassa Nov 23 '22 at 19:29
  • 1
    How well-behaved is your data? If it's extremely noisy, you might need thousands of data points; if it's smooth, you could get away with two. – Mark Nov 24 '22 at 00:58
  • 1
    R^2 is a measure of noise, mostly. Accordingly, if your goal is to get a high R^2 for whatever unholy reason (please don't do that), you can estimate how the noise in your signal behaves and collect data accordingly. If the model is a poor fit for the data overall, it won't help, and if you would insist on pushing R^2 higher, overfit is all you will get. Seriously, stop chasing metrics senselessly, figure out what is the actual problem you are trying to solve. Stephen links an excellent community wiki in their answer: some problems are outright ill-defined and/or hopeless. – Lodinn Nov 24 '22 at 08:42
  • 1
    This question is absolutely impossible to answer – Firebug Nov 24 '22 at 11:44
  • Can you provide some sample data? – AccidentalTaylorExpansion Nov 24 '22 at 12:09

2 Answers2

18

We can't tell you. It depends on your situation and how easy prediction is in your situation.

How many coin tosses do you need to observe before you can predict the next one with 90% accuracy?

Related: How to know that your machine learning problem is hopeless?

And of course, in many situations you can predict with "better than 90% accuracy" without learning at all, namely when one outcome occurs in more than 90% of cases - then just always predict that. For instance: always classify a credit card transaction as non-fraudulent. Most CC transactions are non-fraudulent, so such a useless prediction will look very good in terms of accuracy... because accuracy is not a good evaluation measure.

Stephan Kolassa
  • 123,354
2

I can do this with zero data points. My estimate is 42. Say the true distribution is around 0 with a deviation of 10 (but it can be different as well). Given that, if you define the accuracy as 90%, then this estimate satisfies the condition.

example