What metric should I use for a Regression model with a gamma distributed target?

Question

Background

I'm building a regression model on insurance data to predict the losses associated with a policy. I'm running an Optuna optimisation function to help me with this, but I'm struggling with what metric to use to score the model. The metric defines the course of the optimisation, so I want to get it right.

The model I'm currently using is the LGBMRegressor, and the losses are approximately distributed via a gamma distribution.

So far I've used R^2, but I've read that's not a great goodness-of-fit metric.

Question

What metric should I use to score my regression model?

The gamma distribution is a whole family with radically different shapes depending on the values of its two parameters. What shape are your losses? — Peter Flom, Feb 07 '24 at 12:02
It sounds like you are looking for a point prediction, not (say) a predictive density, correct? (This may be helpful.) If so, this thread may be what you are looking for. Incidentally, it is not necessarily the case that you should use the same evaluation metric in training as later on in evaluation, see here. — Stephan Kolassa, Feb 07 '24 at 12:19
@StephanKolassa yes, I'm looking for a point prediction, although a density would be interesting too. Thank you for the links! — Connor, Feb 07 '24 at 15:54

Georg M. Goerg · Accepted Answer · 2024-02-07T15:32:46.600

4

Use the (negative) likelihood of the distribution as your loss function. You can also turn it into a pseudo r2 for easier interpretability ( and negative likelihood is a 1:1 relation to pseudo r2).

I highly recommend xgboostlss / lightgbmlss for that regression task. https://github.com/StatMixedML/XGBoostLSS https://github.com/StatMixedML/LightGBMLSS

Somewhat related: Working in the field of insurance, I have my doubts that policies are gamma distributed without zero-inflation ( does every policy have a 100% chance to pay out >$0). If you actually do have 0 payout I recommend using zero inflated lognormal distributions (and loss). This is a popular choice in customer lifetime value modeling , and transfers very nicely to policy pricing. See here fir details and implementation ( including loss) https://github.com/google/lifetime_value

In xgboostlss world this is the ZALN distribution.

edited Feb 07 '24 at 15:32

answered Feb 07 '24 at 12:16

Georg M. Goerg

3,461

Thank you, those githubs look really interesting! Are these tools commonly used in the insurance industry? It looks like they allow you to flexibly create a distribution, is that correct? What do you mean by the "negative likelihood of your distribution"? Does that basically mean try and fit to a distribution and the loss is how poorly your distribution fits? – Connor Feb 07 '24 at 14:38
1

Yes, that's right. See e.g. eq (3) and (4) int the LTV paper for the zero inflated lognormal negative log likelihood loss. – Georg M. Goerg Feb 07 '24 at 15:31

What metric should I use for a Regression model with a gamma distributed target?

Background

Question

1 Answers1