Better function to fit log-like data?

Question

I'm trying to fit a two-parameter function to data that looks like this (black dots, x scale is logarithmic):

The best fit I could find is an $arctan$, measured by the MSE. All the seven functions I tested and the Python code are shown below.

Is there perhaps a better function to fit this data that I might have overlooked?


import numpy as np
import matplotlib.pyplot as plt
def main():
    x = np.array([0.033, 0.062, 0.104, 0.210, 0.424, 0.861, 1.040, 1.133, 1.935, 3.803, 6.289, 11.519, 29.118])
    y = np.array([
        0.077, 0.149, 0.187, 0.229, 0.299, 0.419, 0.469, 0.499, 0.679, 0.806,
        0.888, 0.928, 0.956])
N = 100
for i, func in enumerate((f1, f2, f3, f4, f5, f6, f7)):
    beta_opt, alpha_opt, delta_old = np.nan, np.nan, np.inf
    for alpha in np.linspace(0., 2, N):
        for beta in np.linspace(0., 5, N):
            y_fit = func(x, alpha, beta)
            delta = np.square(y - y_fit).sum()
            if delta &lt; delta_old:
                beta_opt, alpha_opt = beta, alpha
                delta_old = delta
    x0 = np.linspace(0.01, 30, 100)
    y0 = func(x0, alpha_opt, beta_opt)
    plt.plot(
        x0, y0, ls=':', lw=2,
        label=&quot;f{}, alpha={:.3f}, beta={:.3f}, MSE={:.3f}&quot;.format(i+1, alpha_opt, beta_opt, delta_old))

plt.scatter(x, y, c='k')
plt.xscale('log')
plt.legend()
plt.show()


def f1(x, alpha, beta):
    return alpha + beta*np.log(x)
def f2(x, alpha, beta):
    return alphanp.tanh(xbeta)
def f3(x, alpha, beta):
    return alpha + beta*np.tanh(x)
def f4(x, alpha, beta):
    return 1/(alpha+beta*np.exp(-x))
def f5(x, alpha, beta):
    return alpha + 1/(1+beta*np.exp(-x))
def f6(x, alpha, beta):
    return alpha - np.exp(-beta*x)
def f7(x, alpha, beta):
    return alpha + beta*np.arctan(x)
if name == 'main':
    main()

Are you restricted to just two parameters? The curve looks more complex than that. There is nothing wrong with using a logarithmic x-axis scale, it is done all the time fro dose-response studies. See this guide for the basics: https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_principles_of_curve_fitting.htm — Michael Lew, Mar 26 '24 at 19:47
I restrict the fit to two-parameters to avoid having too many parameters to fit later on (this is part of a much larger model) — Gabriel, Mar 26 '24 at 19:50
Looks a bit like a (sigmoid) Emax function such as $E_0 + E_\max x^h / (x^h + EX_{50}^h)$, where maybe $E_0$ is just 0 and $E_\max$ might be 1, so perhaps just $x^h / (x^h + EX_{50}^h)$ (or perhaps even with $h=1$). This function is often used for dose-response models (e.g. drug dose vs. efficacy) or concentration-response. — Björn, Mar 27 '24 at 10:31

Nick Cox · Accepted Answer · 2024-03-28T08:53:34.490

As you helpfully provide the data, speculation about what might work, well or at all, can be tempered by experiment. This isn't a complete answer, as I suspect that you need something more complicated than you want to do anything close to tracking the data. But it won't fit in a comment.

I just plotted logit $y$ against log (meaning ln, but any other base would serve) $x$ and fitted a straight line by plain regression. The good news is $R^2$ of $0.9737$ which in many fields would be cause for champagne, or the local equivalent. The bad news is that it is not catching much of the structure of the data.

The axes are labelled in terms of the original values.

I agree broadly with the idea of a sigmoid curve, as although oddly this isn't stated in the question, the outcome seems to fall within the interval $(0, 1)$. But it's not obvious to me that you need iterative fitting to do much better.

Note that atanh() and logit() are just sisters under the skin.

EDIT: It's not clear why arctangent is of interest as it's not sigmoid over the range of the data. Here is the equivalent plot. It performs worse than the logit, in my view. I cite $R^2$ of $0.9602$ for what's it worth, which is very little.

NOTE ON USE OF $R^2$: By citing $R^2$ and being dismissive of it I am playing a double game. First off, some readers may wish to know its value. But, but, but:

When data show monotonic change, almost any fit, even a lousy one, will show very high $R^2$. That is most a very little deal. The real question is whether a fit captures the pattern of change and leaves just noise in the residuals.
The first $R^2$ was for the space shown on the plot, namely (log $x$, logit $y$), and the second was for a different space, namely (log $x$, atan $y$). These are different spaces and $R^2$ is not rendered exactly comparable by being scaled to $[0, 1]$ in both. But note that e.g. RMSE if calculated in these spaces would be even less comparable.
I didn't use nonlinear least squares for which any $R^2$ would presumably refer to the space of the original data $(x, y)$. I am just seeing how far you can get with transformations and plain regression.

Thank you Nick. I'm trying to reproduce this analysis in Python, how did you generate the data to produce these plots? If I transform my y data to logit and my x to log, the result is $y=x/(x+1)$ which is a nice fit, but still not as good as the $arctan$ (measured by the MSE at least) — Gabriel, Mar 27 '24 at 12:42
I am fitting (Y := logit y) = a + b (X : = log x) and $R^2$ is measured relative to Y and X, not the original y and x. On the transformed scale logit clearly gets closer to the shape of the data. I wouldn't want to assess fit on the original scales, as that doesn't discrmininate well at the extremes. — Nick Cox, Mar 27 '24 at 14:06
From a purist or even a practical point of view $R^2$ has many limitations and at most should only be compared within the same data or transformed space. The pattern and relative size of residuals are a bigger deal. — Nick Cox, Mar 27 '24 at 17:14

score 2 · Answer 2 · answered Mar 26 '24 at 22:35

You can use a sigmoid to roughly fit the data with the bottom and top parameters fixed (they seem fixed already to 1 and 0 in the data) and then the two parameters in the fit can be location (EC50?) and slope. There are many different equations that give sigmoid curves, but you could probably do well to use a standard function for log dose-response curves:

$y=baseline+\frac{range}{1+e^{-slope*(x+pEC50)}}$

Substitute 1 for the range parameter, and 0 for the baseline and to get a two parameter curve will fit roughly your shape. Note that the x value is log, so you need to input it as the logarithm, and the location parameter, pEC50 is the negative of the log x value needed for a response of 0.5 (with the baseline and range fixed as described).

I do not know if the logistic will be a superior model for you to fit to your data, but the function is well-behaved and should converge to a solution easily with a starting value (for your data) of about -1 for pEC50 and 2 for slope (the latter is a guess).

Better function to fit log-like data?

2 Answers2

Linked