1

I am a data science intern and I have been tasked with testing the time scalability of the schedule builder. Basically I have collected data and made a bunch of fit lines using the lmfit module. Now I need to run the schedule builder and see how well my fit lines predict the results. Unfortunately in school I never had to take it farther than making the fit lines so I am a bit unclear how to proceed. I have been reading about confidence intervals but my statistics is a little shaky and I am not looking for theoretical accuracy but am testing my model against real results. All of my fit lines look like this

output = (slope +- slope_error)*input + (intercept +- intercept_error)

So really what I need help with is finding the range of acceptable outputs based on a specific input. Can I use normal error propogation? For instance output = input +- sqrt((slope_error^2) + (intercept_error^2)) And if my result falls within this range then my model predicted correctly.

My other thought is that we can use the errors to get the acceptable range. For example the maximum possible value is: output = (slope + slope_error)*input + (intercept + intercept_error)

And the minimum value is output = (slope - slope_error)*input + (intercept - intercept_error)

So if my result falls in between these two values then the the model is good.

Or perhaps both of these are incorrect, I am a bit lost right now.

  • try to predict "output" using the given values of "input" and your line parameters, then compute the error as the difference between your predicted output and the "real" output. From the error your measures of goodness are mean (hopefully close to 0) and variance. –  Dec 07 '21 at 15:25
  • A fitting line is basically two parameters: (m, n) sometimes called (x1, x0). To evaluate a new point x just do ypred=x*m+n and you will get the predicted value ypredwhich you can compare with the real value yreal. The distance metric you use depends on the problem. L1, L2, Mahalanobis... – Elerium115 Dec 07 '21 at 15:29
  • so basically you can not use the errors in the parameters to test the model? I will get the differences between the real and theoretical output and look at those. Thanks! –  Dec 07 '21 at 15:37
  • ... and don't forget to plot your data: scatter plot of output vs. input plus your fitted line. sometimes a line is not the most appropriate curve to fit. –  Dec 07 '21 at 16:28
  • appreciate it :) –  Dec 07 '21 at 17:43

1 Answers1

2

If you are using the lmfit Python package, and specifically the lmfit.Model interface to data fitting, then calculating the "the range of acceptable outputs" is pretty straightforward with the eval_uncertainty method. You can use that with (for example) matplotlib's fill_between method to visualize the range.

A simple example might look like

import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import LinearModel

make up something nearly linear and slightly noisy:

x = np.arange(51) y = 4 + 0.07x + 0.0003x*(x-30) + np.random.normal(size=len(x), scale=0.1)

mymodel = LinearModel()

create slope and intercept parameters, with initial values. See notes

params = mymodel.make_params(intercept=y.min(), slope=(y.max()-y.min())/(x.max()-x.min())) result = mymodel.fit(y, params, x=x)

print(result.fit_report())

dely = result.eval_uncertainty(sigma=3)

plt.plot(x, y, 'o', label='data') plt.plot(x, result.best_fit, label='fit')

plt.fill_between(x, result.best_fit-dely, result.best_fit+dely, color="#ABABAB", label=r'1-$\sigma$ uncertainty band') plt.legend() plt.show()

This will print out a report of

[[Model]]
    Model(linear)
[[Fit Statistics]]
    # fitting method   = leastsq
    # function evals   = 7
    # data points      = 51
    # variables        = 2
    chi-square         = 0.76282188
    reduced chi-square = 0.01556779
    Akaike info crit   = -210.330374
    Bayesian info crit = -206.466723
[[Variables]]
    slope:      0.07622481 +/- 0.00104910 (1.38%) (init = 0.07931648)
    intercept:  3.86210965 +/- 0.03043598 (0.79%) (init = 3.907146)
[[Correlations]] (unreported correlations are < 0.100)
    C(slope, intercept) = -0.862

and produce a plot like this:

enter image description here

which shows the (correlation-included) 1-sigma confidence band for the best fit.

Now, it should be said that such a fit for a linear model (and here, linear means "linear in the parameters" so any polynomial or even any model that can be linearized) you do not need iterative methods such as used by lmfit -- you could do this all analytically by regression methods. lmfit allows you to easily extend that to non-linear methods.

Linear regression does not require starting values, but the non-linear solvers do, so lmfit needs them. The guesses for these values do not need to be very good for truly linear data, so the crude values here are sufficient. If embedding into a real or more complex analysis process, you might want to do that better.