0

I have been searching for an estimator that can find patterns for data with two columns. Below is an example of data to explain what exactly I am working with:

A B
1 0.22
1 0.30
1 0.11
2 0.07
2 0.50
2 0.90
3 0.89
3 0.01
3 0.99
4 0.76
4 0.56
4 0.45

So, data is something like this and I want to get a prediction for col B when I give an input as 5 for col A and so on for future cases of col A. I have tried RandomForestRegresor() from scikit but it hasn't been performing very well (High MAE and MSE).

Data I am working on looks like this: Data

harthik
  • 11
  • If you upload a plot of all your data, people can give you a clearer suggestion. – seanv507 Jun 13 '23 at 10:10
  • 1
    Hey @seanv507 just attached a plot. – harthik Jun 13 '23 at 10:41
  • That plot's usefulness is limited. It has two major problems. First, it doesn't reliably convey the data because it looks like many of the points could be overplotted. Second, it appears to represent derived data computed perhaps by dividing failure counts by attempts -- and that eliminates crucial information about the varying numbers of attempts. Any successful attempt to find patterns must overcome these inherent problems. – whuber Jun 13 '23 at 16:45
  • @whuber I derived fail_rate from pass counts and fail counts. I do still have both the columns but I didn't think they will be useful to predict when only given a sprint_name as input. – harthik Jun 13 '23 at 17:08

1 Answers1

1

If you plot your data, it looks noisy but fairly linear.

The data from the question plotted as a scatter plot with a regression line.

Why not use linear regression for it? You have very little data (unless you have more than you've shown us), so a simple model like linear regression sounds like a good bet. Your predicted column B seems to range between 0 and 1, if those are strict bounds, you could use logistic regression or beta regression, but depending on how you intend to use the model, this may not be necessary. If you have only one feature, using a more complicated model seems like an overkill.

Moreover, from what you are saying, you want to extrapolate beyond the data. Your column A ranges from 1 to 4 and you want to make a prediction when it's equal to 5. Models like random forest can't really extrapolate, so this is another reason for not using it here. Extrapolation is hard in general, with a simple model you can at least understand what it is doing and have a better understanding of its shortcomings.

Random forest is good at figuring out interactions between features, if you have only one feature, the regression trees which are used by it will behave like a piecewise regression (but always constant at edges). Averaging many such trees would lead to a smoother regression line, but again, not much better than something simple like local regression. There are many better alternatives if you have a single feature.

So if you have little data, only one feature, and want to extrapolate, linear regression seems to be a good choice.

Tim
  • 138,066
  • I do have more data (roughly 1000 rows) and col A is not in the range 1,4 it goes more like 1,50 but col B is bound to 0,1. Linear Regression sounds good but would it work for more data? – harthik Jun 13 '23 at 09:34
  • @HARTHIKMALLICHETTY More data is not a problem. I assumed that you have little data, then it would be preferable. The other two points apply: you have one feature and want to extrapolate, so something like linear regression seems to be a good choice. If that doesn't work, as I mentioned, you could try other regression models, but for the mentioned reasons, better start simple. – Tim Jun 13 '23 at 09:36
  • Hi, I just added a graph of how my data looks to the question. In my opinion, the Linear regression would not fit that graph properly and may have high residual. – harthik Jun 13 '23 at 10:32
  • @harthik your plot definitely does not show 1000 points, so I guess they are overplotted. In such a case, it is hard to say what kind of relation between the points is there. – Tim Jun 13 '23 at 12:45
  • Okay, but if I assume the plot for its face value as in no overplotting, then how could I deal with it? – harthik Jun 13 '23 at 17:10
  • @harthik if there's no over-plotting, then it seems that the prediction would be something like a flat line around 0.1, you don't even need linear regression. – Tim Jun 13 '23 at 17:24