1

I have modelled count data through the use of some covariates, using as link family the Poisson distribution. Then, I've predicted expected counts for an external dataset not used to train the model, but where the values of the covariates were available.

In such a dataset I have also the observed counts. For this reason, I would now be able to perform some kind of cross-validation and compare the predicted values of my model with the observed values. Since the model is not linear, I cannot use a normal R2 to compare observed vs predicted values.

What could be the proper metric to use in this context to get an information similar to the one that the R2 gives for linear models?

nd091680
  • 113
  • Welcome to Cross Validated! You might choose to look at a different statistic, but why can’t you look at $R^2?$ The comment about nonlinearity does not make sense to me. // $R^2$ is funky once you depart from least squares linear regression. What would be your calculation of $R^2?$ I give my preferred calculation here, which is equivalent to other common definitions in the simple least squares linear regression case and makes a lot of sense to me as we generalize. – Dave Dec 04 '22 at 19:34
  • Sorry Dave, I didn't see your comment in full, and I replied just to the first part. In the answer you pointed out, do you mean that this second definition of R² is equivalent to the first one and can be generalized to other models? Thanks! – nd091680 Dec 04 '22 at 19:39
  • It’s just an equation. Stick the numbers in the formula and get a result. – Dave Dec 04 '22 at 19:47
  • Thanks. Just to clarify: when I use that equation I calculate the squared RMSE between the predicted and observed values. At the denominator, I have to use the variance of the observed values, is it correct? – nd091680 Dec 04 '22 at 19:51
  • That equation is completely literal. In the numerator, calculate the residuals, square them, and add them. In the denominator, calculate the differences between the true values and the overall mean of $y$, square those differences, and sum them. Then do the division and subtract that quotient from $1$. // However, this calculation is not really the subject of your question. What are you looking for in an evaluation metric? That will guide your choice of evaluation metric. If you’re interested in squared residuals, use squares residuals. Otherwise, don’t use them; use something else. – Dave Dec 04 '22 at 19:55
  • Ok, I got it. What I am really interested about is to understand how good is my model in making predictions, and this means how good my Poisson model predicts values that are close to the observed ones. Do you know which metric should I use in this particular case? – nd091680 Dec 04 '22 at 20:01
  • There are tons of metrics capable of this. If you have a sense of what you like about metrics used in other situations, it might be more productive for you to give some consideration to those and post your thoughts, rather than having me list a few metrics that make sense to me. – Dave Dec 04 '22 at 20:04
  • Thanks again for your answer. I would be really interested in knowing which metric you would prefer to use is such a context, because I am really new to this field and I'd appreciate a help to be pointed in the right direction – nd091680 Dec 04 '22 at 20:49

1 Answers1

1

Let there be $n$ observations and $p$ features (so $\beta_0$ is the intercept). Thus, your linear predictions are $p_i = \hat\beta_0 + \hat\beta_1 x_{i, 1} +\cdots + \hat\beta_p x_{i, p}$.

A natural choice is to use the optimization criterion for calculating the regression parameters. Like other generalized linear models, Poisson regression uses maximum likelihood estimation of the parameters, meaning that the estimates $\hat\beta$ is the $\hat\beta$ that maximizes $ \underset{i=1}{\overset{n}{\sum}}\left( y_ip_i - e^{ p_i} \right) $. (You can refer to the Wikipedia derivation for the details.)

This is (related to, see the derivation) the log-likelihood of the Poisson distribution.

Consequently, it is totally reasonable to talk about how high of a likelihood your model has.

However, there is no context for what constitutes a good or even tolerable likelihood. $R^2$-style metrics that are limited to $(0, 1)$ or $(-\infty, 1)$ kind of achieve that. If you consider the usual $R^2$, it is a comparison of the square loss achieved by your model compared to a naïve model that always predicts $\bar y$ every time. This makes some sense. Since the regression aims to predict the conditional expected value, what better baseline model that you must beat than a model that always predicts the overall mean $\bar y?$

$$ R^2 = 1-\dfrac{ \underset{i=1}{\overset{n}{\sum}}\left( y_i -\hat y_i \right)^2 }{ \underset{i=1}{\overset{n}{\sum}}\left( y_i -\bar y \right)^2 }\\= 1-\dfrac{ \text{Square loss of your model} }{ \text{Square loss of the baseline model} } $$

A way to view the fraction is a comparison between the sum of squared residuals achieved by your model compared to the sum of squared residuals achieved by the baseline model. This sum of squared residuals is the loss function, often called "square loss".

In Poisson regression, we do not aim to minimize the square loss. We aim to maximize the Poisson log-ikelihood, equivalent to minimizing the negative Poisson likelihood, $ \left(-\underset{i=1}{\overset{n}{\sum}}\left( y_ip_i - e^{ p_i} \right)\right) $.

Consequetly, I say to apply the same idea. You have your model that achieves some Poisson loss. You can find the Poisson loss of a model that always predicts the overall mean $\bar y$ by calculating $\left(-\underset{i=1}{\overset{n}{\sum}}\left( \bar y y_i - e^{ \bar y} \right)\right) $. Put them in an $R^2$-style equation.

$$ R^2 = 1-\dfrac{ \left(-\underset{i=1}{\overset{n}{\sum}}\left( y_ip_i - e^{ p_i} \right)\right) }{ \left(-\underset{i=1}{\overset{n}{\sum}}\left( \bar y y_i - e^{ \bar y} \right)\right) }\\= 1-\dfrac{ \text{Poisson loss of your model} }{ \text{Poisson loss of the baseline model} } $$

In fact, this is how McFadden's pseudo $R^2$ works for binomial models (e.g., logistic regression), but with binomial likelihoods instead of Poisson, so there is precident for extending the conventional $R^2$ in this way.

An advantge of this is that it arises as a natural generalization of a popular and understood technique, $R^2$. Drawbacks include the relative obscurity (harder to explain to bosses/customers) and perhaps a lack of software implementation.

One property held by the usual $R^2$ that the above, annoyingly, will not satisfy, is being $1$ when predictions are perfect. For that reason, we might choose to subtract out the value achieved by perfect predictions, both in the numerator and denominator. Subtracting out the loss value achieved by a model that makes perfect predictions does not change the parameter values, so the model is equivalent and can be seen as a valid loss function and valid statistic to use in the numerator (for you model) and the denominator (for the baseline model).

On the other hand, there are ways of deriving an $R^2$-style metric using deviance statistics that are more general than the familiar sum of squares in the usual $R^2$. There is appeal to this, and you might like the idea of giving the "proportion of deviance explained". Disadvantages include the fact that people are less likely to have a technical understanding of deviance like they (at least think they) have when it comes to variance, and the lack of comparison to a baseline model (which I find extremely intuitive). A further disadvantage is that I suspect Ben's linked explanation to rely on the linear aspect of the GLM and on maximum likelihood estimation of the parameters. If his deviance $R^2$ behaves like regular $R^2$ in the linear case, the "proportion of deviance explained" interpretation will be invalid, much as "proportion of variance explained" is only valid for $R^2$ under particular conditions like linear models and estimation via least squares. Out-of-sample testing could be funky, too, and I have a strong opinion about $R^2$ and $R^2$-style statistics when they are applied to data on which the model was not trained.

However, depending on what you value, you might be interested in just the usual $R^2$. If you want to measure how you do in terms of squared residuals, the usual $R^2$ could be an excellent option. Indeed, getting back to binomial models, the usual $R^2$ is a viable metric. Your reference does list some drawbacks of the usual $R^2$, yes, but they might be okay. First, $R^2$ being interpreted as the proportion of variance explained is the exception, not the rule, and you don't even need nonlinearity to wreck that interpretation, depending on how you estimate the coefficients (1)(2). Second, while it is true that the usual $R^2$ is not bounded below by $0$ when you estimate the coefficients through a method other than least squares (Poisson regression minimizes Poisson loss, not square loss), a value below $0$ signals to you that your model is outperformed in terms of square loss by the naïve baseline model that always predicts $\bar y$. It strikes me as a feature, not a bug, for a statistic to flag cases where your performance is poor.

For all of these, unfortunately, there is no objective measure of what constitutes a good model. While it is true that a value less than zero indicates performance worse than your baseline, which is reasonably considered a "must beat" level of performance, how high constitutes good performance depends on the problem at hand.

Hopefully discussing these will guide you to choosing an alternative, should none of them suit your needs.

Dave
  • 62,186
  • You've tagged the question with [tag:cross-validation]. If you want to explore out-of-sample assessments of performance like cross validation does, that perhaps warrants its own question. // I have a strong opinion about out-of-sample $R^2$-style statistics and would apply similar logic to an $R^2$-style statistic based on Poisson loss. – Dave Dec 06 '22 at 17:03