Figuring out a good fit for this data

Question

I am trying to find an appropriate mathematical model/equation for this data. Physically, it is essentially linear correlations of rainfall error (y-axis) with distance (x-axis). So for very short distances, the errors are highly correlated, but this correlation drops off sharply. For large distances, the correlations should asymptote to values close to zero. Here are some example scatter plots for each month of the year.

Jan - y-intercept greater than 1. Too sharp of a decay.

Feb - correlations are too low for small distances.

Mar - perhaps the best-looking fit.

Apr - Capturing the drop off to negative values and then going back towards zero is realistic and would be a bonus I am unable to do with a simple Exponential equation. Fit fails this month.

May - not bad.

Jun - y-intercept greater than 1.

Jul - perhaps too sharp of a decay.

Aug - not bad.

I have tried an Exponential model, which is decent for some months but poor for others. For the months when it does poorly, it does not capture the sharp drop off and critically, the y-intercept is greater than 1 which is not physically possible. Are there any equations you can suggest that can keep the y-intercept less than 1, and which look like they could be good for this data? I am using lmfit on Python to attempt to fit a model to them (that is the blue line shown). If you have experience with this module, that would be great!

Thank you.

I would be weary both taking and giving advice on data modeling from just seeing a couple pictures. — MathIsLife12, Jul 08 '22 at 05:22
Can you post your data? Also, do you have timestamps, or only information about th month of each data point? If the latter, do all (say) "March" observations at least come from the same year, or could they be from different years? If so, do you have at least this information? — Stephan Kolassa, Jul 16 '22 at 10:00
Hi Stephan, it is the same month from different years. For example, the March data is the correlations from different pairs of stations (all valid pairs from a set of stations) from Marches from 2001 to 2021. — AzureWinds, Jul 22 '22 at 08:05

kqr · Answer 1 · 2022-07-16T10:51:30.120

TL;DR: Step one is figuring out for yourself, "what do I mean by a good fit?"

Once you have done that, you might discover that the rest of the answers come much easier to you. Before you have done that, nobody can help you, least of all us.

Some general advice that should be helpful: you need to specify what the cost of misprediction is in both directions (e.g. is underestimation worse than overestimation, and by how much?)

This is tied to how you are going to use the data in practise.

Once you have that, you can start with a baseline model that just predicts a constant value (e.g. average, or whatever minimises cost) so you have something to compare your other models to.

Another piece of general advice: simpler is usually better. This for two reasons: it's easier to explain, and it's less likely to fit against noise. Maybe a line is all you need. But it depends on use case and the penalty for errors.

It sounds like you're approaching this backwards. You're looking for a magic formula that hugs closely to a set of points, and then you'll figure out how to use it. It would be better to start from the actual use case and see how advanced a model you really need. It's usually simpler than you think.

Another thought: do you need to fit a theoretical model at all? The alternative is drawing from the empirical CDF formed by the data you have observed. There's usually only three reasons to fit a theoretical model:

To extrapolate to values you have not seen.
To help understand the dynamics of the problem better.
Because convention requires it.

Does any of these apply to your situation? If not, you can probably get by much easier.

Hello kqr, thank you for your reply. I believe the alternative of drawing from the empirical CDF is a better idea. Do you have any resources on how to do via Python? — AzureWinds, Jul 22 '22 at 08:02
Is it sufficient if I tell you that you could store each pair of (rainfall error, distance) in a list and then pick a random element from that list any time you want a new draw from the empirical distribution? @AzureWinds — kqr, Jul 22 '22 at 18:06
Hello kqr, I understand how to put it in a list and draw a random element, but I am confused how to make an empirical distribution from that. Do you have a link that has some resources on that concept? — AzureWinds, Jul 26 '22 at 00:42
The draws from that list are draws from the empirical distribution. If you need an actual c.d.f. you can just sum up the values, like you would with any p.d.f. — kqr, Jul 26 '22 at 06:27
Sorry for the big break in reply. So returning to this question, if I had to find what a value from a hypothetical fit at distance = 0 would be, how would I do this from the empirical distribution? — AzureWinds, Aug 10 '22 at 00:12

Figuring out a good fit for this data

1 Answers1