Fitting data taking into account for the spread in data, which are zero for some data points

Question

I'm trying to use scipy to fit a $\tanh$ function to some data. The data is of the form $(x_i, y_i)$ for $i=1,\cdots,N$, where $0\leq y_i \leq 1$. I choose $x_i$ to be linearly spaced, such that $x_0=0$ and $x_N=1$. $y_i$ are further obtained from repeating an experiment $M$ times, and checking an event happened or not ($0$ -> did not happen, $1$ -> did happen). The $y$ values are thus a point estimate of probability $p$ for a binomial variable: $y_i=\frac{1}{N}\sum\limits_{j=1}^M X_j$. The standard deviation of $y_i$ is thus $s_i=\sqrt{y_i(1-y_i)/M}$. This is how I end up with data that is spread like $0.5\tanh(a(x-b))+0.5$, which is what I'm trying to fit.

There is more spread in the data if the values are around $0.5$, but many data points at $0$ and $1$ have a standard deviation of $s_i=0$ (such that $X_j=0$ or $X_j=1$ for all measurement repeats $j$).

I tried the method described in this answer. For that particular code, if I set one of the standard deviations to be 0, i.e.:

y_spread[3] = 0

then I get a runtime error:

RuntimeWarning: divide by zero encountered in divide

This makes sense to me, as $s_i=0$, and you can't divide by $0$. Now, the question is, what is the correct way to handle this, statistically? A quick and dirty way could be to set the error to be something small, like 1e-6, when it is 0. This does result in a fit, but am I angering the statistics gods?

EDIT: added more information as requested in the comments.

I can't speak for the others, but I was not personally angered by this. — John Madden, Mar 30 '23 at 18:18
(more seriously: the problem is that this functions uses the Standard dev to figure out which datapoints it should prioritize fitting. By using a standard dev of 0, you are asking for infinite sensitivity, and the numerical procedure cannot proceed. You should be OK using a small value like 1e-6. To be sure, use a logarithmic range of values, and see how your analysis changes. ) — John Madden, Mar 30 '23 at 18:20
The nature of your data is unclear: if they "go from 0 to 1," how are you fitting anything? Surely you need data pairs, right? And in the pairs, which of the components is measured? One might guess you have $(x_i,y_i)$ pairs with $0\le y_i\le 1$ being your measurements. If so, how are the $x_i$ determined, what kind of measurement is being used to find the $y_i$ (its statistical properties loom important), and what fitting method are you using? — whuber, Mar 30 '23 at 18:29
@whuber yes, that is the nature of the data. I repeat an experiment at each $x_i$ 20 times, which gives me a bunch of 1s and 0s (because I'm checking whether an event happened or not), the average of which is $0\leq y_i \leq 1$. Standard deviation on $y_i$ are thus $s_i=\sqrt{y_i(1-y_i)/20}$. So I choose $x_i$ to be linearly spaced. I'm using the default in scipy.optimize.curve_fit, so Levenberg-Marquardt. — sodiumnitrate, Mar 30 '23 at 19:58
This kind of problem is much better suited for (Binomial) logistic regression, not curve fitting. You can fit specific curves using maximum likelihood. As an example (with a different function than tanh, but otherwise an identical situation) see my post at https://stats.stackexchange.com/a/64039/919. You can also code the log likelihood function and directly optimize it. — whuber, Mar 30 '23 at 20:23

Fitting data taking into account for the spread in data, which are zero for some data points

0 Answers0