Adding rule on top of ML to manipulate RMSE

Question

I am currently working on a regression problem to predict customer revenue in the next 3 or 6 months etc.

So, am currently using different regression approaches. However, I see that the predicted y values are negative (for monetary value). but monetary value cannot be negative.

But my R2 value is 90% in train and 82% in test set. However, when I find out the minimum value in predicted Y values, they are negative. My RMSE is around 60K in train and 65K in test. The actual values for revenue range from 9.8 to 6980753

So, do you think it is appropriate to introduce a rule (on top of my ML output) to say convert/replace all predicted negative monetary values to zero.

Something like below in python

np.where(y_pred<0,0,y_pred)

Would this indicate I am messing up with the model result and it is not ethical to do this? Because negative values don't make sense anyway. At least replacing them with zeroes would improve the performance metrics like RMSE or R2. I don't intend to do this but would like to know what is appropriate and right thing to do

Why are negative values impossible? Predicting a negative value seems like a prediction of losing money. Businesses tend to dislike when that happens, so being able to predict such events seems valuable. — Dave, Nov 20 '22 at 14:39
@Dave they'r talking about revenue, which is the total amount of money coming into the company. Profit (revenue-expenses) can clearly be negative; not so clear with revenue. — John Madden, Nov 20 '22 at 15:01
My two cents: there are fancier approaches that can ensure positivity of the prediction (some of them aren't all that complicated: like just taking the log of revenue), but the approach of thresholding negative predictions to zero is generally fine, and certainly not unethical. — John Madden, Nov 20 '22 at 15:03
Revenue can certainly be negative, e.g., in a retail environment if someone returns something for a refund. I second Dave's recommendation to use an appropriate distributional family. As Dave also notes, simply log-transforming the dependent variable will yield biased predictions if we do the naive back-transformation, see here for bias adjustments. That said, of course we can post-process predictions to truncate at zero. — Stephan Kolassa, Nov 20 '22 at 15:41
@StephanKolassa that's a really interesting scenario, do you have an accounting source which indicates that refunds should be viewed as negative revenue rather than an expense? Regardless, I sense we may be getting off track from the OPs perspective... EDIT: Seems like they exist, but are rather rare: https://blog.phoenixcontact.com/marketing-sea/2010/09/negative-revenue-financial-reporting/ https://www.quora.com/Why-do-some-companies-have-negative-revenues-Do-they-fix-a-price-lower-than-the-cost-of-making-knowing-that-it-will-be-a-loss — John Madden, Nov 20 '22 at 16:10
@JohnMadden: to be honest, I am less interested in how they are treated from an accounting perspective, and more in seeing negative observations in the sales time series I have to forecast... My favorite was a series just a while ago, Christmas tree stands, lots of positive sales in December, and one negative sale, obviously a return, in mid-January. Of course, the forecast should still be positive. But bottle returns are also a kind of refund, and whether these are treated as positive or negative is down to the IT integration, and there it may make sense to have negative forecasts. — Stephan Kolassa, Nov 20 '22 at 16:20
@JohnMadden: here is a nice example I presented just last Friday, with one instance of negative 500 unit sales of roof tiles, see the top time series. Someone returned 500 roof tiles. As you see, I had unit sales of this SKU in four different stores, and three of them exhibited negative sales at some point in time. Again, how this is treated from an accounting point of view is less relevant for me and possibly OP than how to deal with it in forecasting. — Stephan Kolassa, Nov 20 '22 at 16:26
@StephanKolassa nice presentation Stephan, thanks for sharing! I loved your reaction to covid mask comment :) what a fun dataset. — John Madden, Nov 20 '22 at 17:21

score 1 · Accepted Answer · answered Jan 04 '23 at 10:58

The "correct" way to ensure nonnegative predictions is to use an appropriate distributional family. You could use a count data family, like negative binomial regression (see Hilbe's textbook), since revenue is in principle discrete... but the numbers are typically so high that you could also use a continuous distribution, like a gamma regression or similar. If there is a high incidence of zeros, you could look at zero-inflated models.

Alternatively, a far simpler method that may be quite enough for your applications would be to indeed simply truncate your predictions at zero. You would need to think about whether the added complexity of a "better" model is outweighed by a better outcome than that achieved by a simple truncation.

Adding rule on top of ML to manipulate RMSE

1 Answers1

Linked