How is the distribution of the data of importance when comparing linear regression results?

Question

I have come across a statistical modelling approach where the changes in nesting date over time are compared between 9 distinct populations of one single species. The method is described like this:

we used simple linear regression of nest initiation date as a function of year to determine whether there were overall unidirectional trends in nesting date over time.

Looking at the datasets, I have noticed that the data density per day is of a totally different shape between one population to another, e.g.:

n=230 observations

versus

n=200 observations

The histograms show the number of data (y-axis) per day (x-axis) over the whole dataset period.

Is there any impact of these structure differences on comparing the results between the two populations when using a linear regression approach? Why or why not?

And I will add this small comment even if it is not maths; to me, it is also strange that the same biological event looks like that different in structure between two populations of the same species.

Update:

For both models, the linear model was done as following:

Population 1 (n=230)

>         summary(lm(nestDay~Year, data=.))
Call:
lm(formula = nestDay ~ Year, data = .)
Residuals:
    Min      1Q  Median      3Q     Max 
-117.54  -81.01  -11.20   81.94  146.98
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 5813.447   4822.082   1.206    0.229
Year          -2.788      2.396  -1.164    0.246
Residual standard error: 83.6 on 228 degrees of freedom
Multiple R-squared:  0.005905,  Adjusted R-squared:  0.001545 
F-statistic: 1.354 on 1 and 228 DF,  p-value: 0.2458

Then:

resid=residuals(lm(nestDay~Year, data=.))
        xyplot(resid~data$Year, panel=function(x,y){panel.loess(x,y,span=0.5,col=1); panel.xyplot(x,y,col=)})

Finally:

> AIC(lm(nestDay~Year, data=.))
[1] 2692.674
> AIC(lm(nestDay~1, data=.))
[1] 2692.036

versus

Population 2 (n=200)

>  summary(lm(nestDay~Year, data=.))
Call:
lm(formula = nestDay ~ Year, data =.)
Residuals:
    Min      1Q  Median      3Q     Max 
-22.730  -6.730  -1.280   5.507  43.619
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -433.4591   571.0403  -0.759    0.449
Year           0.2751     0.2839   0.969    0.334
Residual standard error: 10.19 on 198 degrees of freedom
Multiple R-squared:  0.004719,  Adjusted R-squared:  -0.0003073 
F-statistic: 0.9389 on 1 and 198 DF,  p-value: 0.3338

Then:

FJresid=residuals(lm(nestDay~Year, data=.))
        xyplot(FJresid~data$Year, panel=function(x,y){panel.loess(x,y,span=0.5,col=1); panel.xyplot(x,y,col=)})

Finally:

> AIC(lm(nestDay~Year, data=.))
[1] 1499.997
> AIC(lm(nestDay~1, data=.))
[1] 1498.943

The histograms are of nest initiation days and one observation refers to one individual in one year? If not, what else? — Christian Hennig, Jun 18 '22 at 09:24
"And I will add this small comment even if it is not maths; to me, it is also strange that the same biological event looks like that different in structure between two populations of the same species." I'd agree with this, and for sure one would want to have an explanation for this (which can only be given by those who collected the data.) — Christian Hennig, Jun 18 '22 at 09:26
Yes you are right I will update to give a title to the histograms, sorry for the oversight! — Recology, Jun 18 '22 at 09:28
I'd suspect that there may be serious problems with the assumptions of linear regression, and this may concern the distribution of residuals (which is related to the distribution of the $y$ variable, although one would need to know the $x$-variable to say how exactly), but also other issues such as problems with independence. From the shown data one can suspect such issues, but one would need to look at the relation with $x$ to know for sure. — Christian Hennig, Jun 18 '22 at 09:29
Thanks for your help and answer, I have access to the original script, I will run for both examples what the author did regarding the lm modelling and update the post. — Recology, Jun 18 '22 at 09:42
Data for population 1 don't seem very worrying to me. Although regression assumes the residuals to be distributed continuously, which they clearly are not, this kind of distribution doesn't tend to create problems with the regression. I'd in principle be more worried about linearity, given that the number of days in a year is limited, and a linear function will grow ever further for years in the future, however here there is no evidence of any growth (all insignificant), so the model looks more useless than wrong... (if we ignore potential dependence.) — Christian Hennig, Jun 18 '22 at 09:58
The regression for population 2 as well is all insignificant, no evidence for any linear trend. One may suspect slight nonlinearity here. Generally I don't think the linear models give any information of interest; it'd really be more interesting to explore the difference in distributional patters (very discrete vs. quite continuous) between the two populations, but as written before, one would need to look into data collection and background knowledge. — Christian Hennig, Jun 18 '22 at 10:00
Ok thank you very much again. Could you please just maybe reformulate what you mean by "the model looks more useless than wrong"? If I understand well what you are saying, the lm in this case does not allow to draw any sort of conclusions about the evolution trend across time? — Recology, Jun 18 '22 at 10:03
Reading the method description again, I'd probably qualify what I had written earlier. The authors wanted to see whether there's evidence for a trend over time, and although one could criticise them for trying a linear model where whatever would go on would rather be nonlinear, still interpreting results saying that there is no such evidence looks OKish to me, so one could even say that there is some kind of use, be it to make a negative statement. — Christian Hennig, Jun 18 '22 at 10:06
But if on the contrary they use this modelling to draw conclusions like: "Nesting date advanced in 4 populations by a rate of 0.11 to 0.51 days per year, while nesting dates were delayed in 4 populations by a rate of 0.09–0.30 days per year", then it feels not right, am I correct? — Recology, Jun 18 '22 at 10:09
I only see two results and you're talking about 8 populations, so I don't know what goes on in the other six. The two shown results provide no evidence for any trend. — Christian Hennig, Jun 18 '22 at 10:10
It was just to showcase an example of the type of conclusions they write. The 2 pop are included in these 8 so I think I get your point. Last question, would you say that another regression approach would be more suited? e.g. quantile regression? - if it is something that can be answered just looking at this data :) — Recology, Jun 18 '22 at 10:12
For those two populations it looks like there is so little going on that I don't expect further insight from anything more sophisticated (at least without having more data that could shed a light on potential dependence between observations). Can't comment on the other six. — Christian Hennig, Jun 18 '22 at 10:15
Ok thank you very much again for your time and precious help! — Recology, Jun 18 '22 at 10:17
One could codify year as categorical for the second population and run one-way analysis of variance to say something about differences between specific years, but this won't translate into any overall trend. — Christian Hennig, Jun 18 '22 at 10:18

How is the distribution of the data of importance when comparing linear regression results?

0 Answers0