3

I have data made of vectors $\textbf{x}$ and $\textbf{y}$. I want to predict $\textbf{y}$ with $\textbf{x}$ and a set of hyperparameters $a_{1, ..., 3}$ to be fitted with a linear and a nonlinear model using a python package (curve_fit function from scipy.optimize):

Linear model (in x): \begin{equation}\label{equation_linear_model} \hat{\textbf{y}}_1 = \text{F}_1(\textbf{x},a_1) = \text{x} \cdot a_1 \end{equation} Non-linear model (in x): \begin{equation}\label{equation_non_linear_model} \hat{\textbf{y}}_2 = \text{F}_2(\textbf{x},a_2,a_3) = a_2 \cdot \Big(\frac{1}{e^{(-\text{x}\cdot a_3)}+1}\Big) \end{equation} $\hat{\textbf{y}}_1$ and $\hat{\textbf{y}}_2$ are the predictions of the respective models which have to be as close as $\textbf{y}$ as possible. I want then to compare the performance of those two models by calculating the AIC. I used so far the definition of AIC for linear models as follows (for both models including the linear model). $$ AIC = n \log(\hat{\sigma}^2) + 2k $$ where $$ \hat{\sigma}^2 = \frac{\sum \hat{\epsilon}_i^2}{n} $$ where $n$ is the number of data points, $k$ is the number of parameters, $\epsilon$ is the error of the prediction. What would be the formula to compute the two AIC values (the one from the linear and the one from the non linear model) and a corresponding p-value for significant difference of the two AIC values

ecjb
  • 593
  • I do not see an $x$ in your linear model (which isn't even linear as you've written it). Did you mean $x$ instead of $Q?$ // Are you predicting two different $y$ quantities, or are $y_1$ and $y_2$ the same? – Dave Jun 27 '22 at 14:34
  • I am sorry for the typo. Indeed $Q$ should be $x$. I corrected it – ecjb Jun 27 '22 at 14:40
  • Now are you predicting two different $y$ quantities, or is $y_1$ the same as $y_2?$ – Dave Jun 27 '22 at 14:46
  • @Dave. I changed the notation and added the following in the question: "$\hat{\textbf{y}}_1$ and $\hat{\textbf{y}}_2$ are the predictions of the respective models which have to be as close as $\textbf{y}$ as possible." – ecjb Jun 27 '22 at 15:18
  • @Dave Is the question clearer now or are they still some points which need to be adressed to make it understandable? – ecjb Jun 28 '22 at 15:49
  • Yes, I think the question makes sense. – Dave Jun 28 '22 at 15:50
  • @Dave Ok very good. thank you for the feedback. Of course I will be glad if you think that you have an answer or some any hints – ecjb Jun 28 '22 at 15:57
  • What is the question? – Sextus Empiricus Jul 12 '22 at 08:12
  • 1
    @SextusEmpiricus. I just wrote the clear question at the end – ecjb Jul 12 '22 at 09:06
  • @ecjb That's better. Still, could you clarify the specific use of words. What do you mean by 'How is it possible'? Why do you believe it would not be possible? – Sextus Empiricus Jul 12 '22 at 09:08
  • Regarding the question about p-values, comparing AIC is not about computing a p-value and testing a hypothesis. It is about making a selection among different models. See also Why can't we use AIC and p-value variable selection within the same model building exercise? – Sextus Empiricus Jul 12 '22 at 09:10
  • Thank you for your comment @SextusEmpiricus. I erased the "how is it possible" by "what would be the formula". I am going to read the link you provided on the p-value – ecjb Jul 12 '22 at 14:46
  • You mean what is the formula for $\epsilon$? – Sextus Empiricus Jul 12 '22 at 15:23
  • AIC doesn't require a linear model; if you look at the formula you've written out, there's nothing there that refers to the structure of the model itself. – jbowman Jul 12 '22 at 15:57
  • I notice now the part 'linear' in your phrase "I used so far the definition of AIC for linear* models as follows"*. So your question is how to apply it for the non-linear model? – Sextus Empiricus Jul 12 '22 at 16:20
  • @SextusEmpiricus yes exactly! – ecjb Jul 13 '22 at 06:18
  • Regarding comparing the two models by a test, "using the AIC" basically means running a log likelihood ratio test (if numbers of parameters are the same, all else in the AIC goes away). For this, see here: https://stats.stackexchange.com/questions/110038/generalized-log-likelihood-ratio-test-for-non-nested-models – Christian Hennig Jul 14 '22 at 10:48
  • Your non-linear function is the logistic curve. Are there any constraints on the response y? – dipetkov Jul 15 '22 at 07:26
  • @dipetkov. No: no more constraints than desribed – ecjb Jul 19 '22 at 06:37
  • Thank you all for your contributions to the question. I upvoted all your answers, and accepted the question which seemed the most complete one. But this is obviously no the most scientific way as I am the lay person in that matter. I really hope not hurt feelings for the others. Thank you very much again – ecjb Jul 19 '22 at 07:32

3 Answers3

3

The formula

$$ AIC = n \log(\hat{\sigma}^2) + 2k $$

works because it's proportionate to

$$ AIC = 2k - 2\ln({\hat {L}})$$

where $\hat L$ is the log likelihood for models with a Guassian likelihood $ \hat L = log(\prod_i \mathcal{N}(y_i | \hat y_i, \hat \sigma)) $ or equivalently, $\epsilon \sim \mathcal{N}(0, \hat \sigma)$.

If both of your models are of this form, all that matters for calculating $AIC$ are a) the number of parameters $k$, and b) the mismatch $\epsilon$ between predicted values $\hat y$ and observed values $y$, summarised either as $n \text{ log}(\hat \sigma^2)$ or as $\hat L$, since they're the same value (with opposite signs). The form of of the function $F$ doesn't matter.

As for comparing $AIC$ values, you can forget about p-values, since $AIC$ and p-values are two different and incompatible approaches to hypothesis testing. Focus instead on the actual difference in AIC scores, as discussed, e.g. in this question or this blog. You might also look at AIC weights, which are a more intuitive way of interpreting AIC differences.

As @SextusEmpiricus points out, $\log\mathcal{L}(\beta ; y_i, x_i)$ and $- n \log(\hat{\sigma})$ are proportionate but differ by a constant value, so you should use one or other equation for all your models rather than mixing and matching.

Eoin
  • 8,997
  • Thank you for your answer @Eoin.. So you mean that the AIC is the same for both linear and non linear F. This is what I tried in a previous question but as pointed out by dipetkov, it seems to be wrong in a non-linear function F: https://stats.stackexchange.com/questions/580103/model-selection-is-aic-enough-or-should-one-compute-the-p-value-in-model-select/580116#580116 – ecjb Jul 14 '22 at 11:30
  • @ecjb that might have been a confusion or simplification. It is more specifically the loglikelihood for a Gaussian distribution that can be simplified to $$n\log(\frac{1}{n}\sum_{i=1}^n (\hat{y}_i(x_i)-y_i(x_i))^2$$ and this has nothing to do with linear or non-linear models for $\hat{y}_i(x_i)$. (But often a linear model and Gaussian distribution are used together, ie ordinary least squares regression/OLS. That's how they might get seen as the same) – Sextus Empiricus Jul 14 '22 at 12:05
  • @SextusEmpiricus. Many thanks for your answer. Maybe I rephrase my question on still another level: I am not a statistician and want to use the definition of the AIC for the non linear function for a manuscript in health science. Is it still ok this definition of AIC or is it fundamentally wrong? – ecjb Jul 14 '22 at 12:27
  • I disagree with @SextusEmpiricus. To use the AIC you need to maximize a likelihood function. For the formula you cite the likelihood has to correspond to Gaussian errors. You can't just use any optimizer and then say "Oh, yes, and the residuals after I've done my optimization -- I'll assume these are Gaussian". – dipetkov Jul 14 '22 at 12:37
  • Where in the documentation does it say that scipy.optimize.curve_fit maximizes a likelihood function? – dipetkov Jul 14 '22 at 12:42
  • 1
    @dipetkov From thecurve_fit docs: "Use non-linear least squares to fit a function, f, to data.". Minimizing least squares is equivalent to maximizing Gaussian likelihood. – Eoin Jul 14 '22 at 13:05
  • I don't think that's true for Non-linear least squares but then I'm not the one writing an answer, precisely because I'm not sure about the theory. – dipetkov Jul 14 '22 at 13:30
  • @dipetkov my point was that the likelihood function used here does not relate to linear/non-linear. The likelihood function relates to Gaussian distribution for the response. Whether the conditional mean is modelled as linear or non-linear does not matter. The comment that you made "you can't just use any optimiser", if it works with the non-linear case then it works with the linear case as well. – Sextus Empiricus Jul 14 '22 at 14:37
  • @SextusEmpiricus My read of the question is different: less about theory and more about doing a specific computation with scipy.optimize.curve_fit. To me at least the non-linear part is central to the question. If this optimization does find the MLE, then great; it's exactly what the OP is looking for. – dipetkov Jul 14 '22 at 15:19
  • @dipetkov I am not sure how your reading is different. I have summarised my view in a seperate answer. I believe it shows very well that the likelihood $n\log\hat\sigma$ relates to a least squares model and it is irrelevant whether the model is linear. – Sextus Empiricus Jul 14 '22 at 15:25
3

AIC formula

What would be the formula to compute the two AIC values (the one from the linear and the one from the non linear model)

TLDR; Assuming that you do least squares regression for both linear and non-linear models, your formula to compute AIC works for both.

Your formula is not for linear models but for models with Gaussian distributed responses whose parameters are estimated based on maximum likelihood.

Let the density for a measurement $y_i$ as function of $x_i$ and parameters $\beta$ be:

$$f(y_i;x_i,\beta) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2} \left( \frac{y_i-g(x_i,\beta)}{\sigma} \right)^2}$$

Where $g(x_i,\beta)$ can be a linear function, but also a non-linear function. Then the likelihood function is

$$\mathcal{L}(\beta ; y_i, x_i) = \prod_{i=1}^n f(y_i ; x_i, \beta) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2} \left( \frac{y_i-g(x_i,\beta)}{\sigma} \right)^2}$$

If, you fill in the maximum likelihood estimates for $\beta$ and $\sigma$ (where $\hat\beta$ is also know as the least squares estimate):

$$\hat\beta = \min_\beta:\sum_{i=1}^n\left( {y_i-g(x_i,\beta)}\right)^2$$

$$\hat\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n\left( {y_i-g(x_i,\hat\beta)}\right)^2}$$

, and if you take the logarithm, then you get

$$\begin{array}{} \log\mathcal{L}(\beta ; y_i, x_i) &= &\sum_{i=1}^{n} \log\left(\frac{1}{\sqrt{2\pi \hat\sigma^2}} e^{-\frac{1}{2} \left( \frac{y_i-g(x_i,\hat\beta)}{\hat\sigma} \right)^2}\right)\\ &= &-\frac{n}{2} \log(2\pi ) - n \log(\hat{\sigma}) -\frac{1}{2} \frac{\sum_{i=1}^n(y_i-g(x_i,\hat\beta))^2}{\hat\sigma^2}\\ &= &-\frac{n}{2} \log(2\pi ) - n \log(\hat{\sigma}) -\frac{1}{2} n \end{array}$$

And since log likelihood can be shifted with any constant you can get

$$\begin{array}{} \log\mathcal{L}(\beta ; y_i, x_i) = - n \log(\hat{\sigma}) \end{array}$$

And your formula for AIC is based on that.


So you can apply your AIC formula for both linear and non-linear relationships, $g(x_i,\beta)$, that describe the conditional mean in your model. The formula relates to the Gaussian distribution in the beginning of this post and does not specifically relate to linear models. You can have linear models with a different likelihood than $n\log \hat\sigma$ and non-linear models with a likelihood $n\log \hat\sigma$.

Sidenote: Be aware however when you mix AIC formula's based on different distributions. The shift in constant that can make likelihood functions different does not matter for relative comparisons (see Is the exact value of any likelihood meaningless?), but it becomes tricky when you make comparisons with different distributions.

p values

and a corresponding p-value for significant difference of the two AIC values

Comparing AIC is not typically about computing a p-value and testing a hypothesis. It is about making a selection among different models. See also: Why can't we use AIC and p-value variable selection within the same model building exercise?

If you still would like to test the hypothesis whether both models have the same AIC then you can approach it as a likelihood ratio test that will approach a chi-squared distribution when the models are nested. If the models are not nested then I do not know what sort of distribution they follow but possibly there are might be cases for this as well that can be handled.

  • An important sidenote. AIC does not tell much about the absolute quality of the model. It is only about the relative quality. – Sextus Empiricus Jul 15 '22 at 07:49
  • Oh, that's a good point, I shouldn't haver said that $\log\mathcal{L}(\beta ; y_i, x_i)$ and $- n \log(\hat{\sigma})$ are equivalent, they're just proportionate, right? – Eoin Jul 15 '22 at 09:19
  • "If the models are not nested then I do not know what sort of distribution they follow but possibly there are might be cases for this as well that can be handled." I believe the first answer here with literature is pretty good: https://stats.stackexchange.com/questions/110038/generalized-log-likelihood-ratio-test-for-non-nested-models – Christian Hennig Jul 16 '22 at 12:15
1

This is an extended comment, not an answer.

You want to compute the AIC for a linear and a nonlinear model in order two compare the two models in terms of their predictive performance.

The Aikake information criterion (AIC) estimates how well a model generalizes to new data when the model is well-specified. This assumption is required to derive the number of parameters penalty from the Fisher information matrix [1]. In other words, the AIC is not model-robust [2]. See Figure 7.4 in [3] for a (logistic regression) example that shows how the AIC agrees well with test-sample error when the model is well-specified but not when it is mis-specified (in that case over-parameterized).

So not surprisingly, both the linear and non-linear model should be reasonable fit to the data in order to make a meaningful comparison. Let's take a look at the models you want to compare. Your linear model is a line without an intercept; it's equivalent to a no-intercept linear regression. Your nonlinear model is the logistic function; it's not equivalent to a logistic regression because, to compute the AIC, you implicitly assume that the errors are iid Normal with mean 0 and variance $\sigma^2$.

Aside about leaving out the intercept. This is usually not a good idea but if you stare at your two models carefully (or even better, plot them), you'll notice that leaving out the intercept might be a particularly unfortunate choice in this case. The linear model is a line with slope $a_1$ which passes through the origin (0,0). The logistic curve doesn't pass through the origin but through (0, $a_2$/2). In fact it approaches 0 as x → -∞ and $a_2$ as x → +∞.

The normality assumption may be tricky to justify if the range of y is bounded. Even then you have to show that both a linear and a logistic model are well-specified. The logistic curve is approximately linear only around x = 0. Over a wide range for x it may be hard to argue that both a straight line and a bounded curve are well-specified models for the same observations.

I illustrate this by generating data from a logistic model, adding noise and then plotting the best fitting line (in red) and logistic curve (in blue). For fun, I also add the no-intercept line (in gray) and a restricted cubic spline with 4 knots (in green). You could consider making a similar plot for your real data.

[1] C. R. Shalizi. Advanced Data Analysis from an Elementary Point of View (2021). Available online.
[2] G. Claeskens and N. L. Hjort. Model Selection and Model Averaging. Cambridge University Press (2008)
[3] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2008)

enter image description here

model parameters AIC (left) AIC (right)
straight line 3 -146.6279 -182.7456
logistic curve 3 -181.9461 -182.2991
cubic spline 5 -181.0472 -182.3560

The restricted cubic spline, which allows for nonlinearity in a flexible way, has the same predictive performance (in term of the AIC) as the logistic curve even though the data is simulated from the logistic model and the spline has two extra parameters. The line without intercept doesn't fit at all in either case while the line with an intercept is the best fit for the data in the right panel.

dipetkov
  • 9,805