GLS Function - Fitting and Interpretation Issues

Question

I am well aware, that this is a FAQ, but other questions could not provide me answers to my question. Also, I hope this will not be considered a double post, since I have posted this issue with a different question (see here).

I have data that I fit into a gls model from the nlme package in R. This is what the data looks like

fid|Treatment|NDVI_mean|date      |Management1
1          G 0.7918969 2018-10-11     Control
2          J 0.1899928 2018-10-11    Clearcut
3          H 0.2378630 2018-10-11    Clearcut
4          C 0.3631807 2018-10-11       PR1/2
5          D 0.3294372 2018-10-11       PR1/2
6          E 0.4611494 2018-10-11       PR3/4
7          F 0.3681659 2018-10-11       PR3/4
8          I 0.3020894 2018-10-11    Clearcut
9          B 0.3247002 2018-10-11          NR
10         K 0.3109387 2018-10-11          NR
11         L 0.8071204 2018-10-11     Control
12         G 0.8191044 2018-11-28     Control
13         J 0.1952865 2018-11-28    Clearcut
14         H 0.2209233 2018-11-28    Clearcut
15         C 0.3942993 2018-11-28       PR1/2
16         D 0.3285107 2018-11-28       PR1/2
17         E 0.4192142 2018-11-28       PR3/4
18         F 0.3856550 2018-11-28       PR3/4
19         I 0.2522038 2018-11-28    Clearcut
20         B 0.3300189 2018-11-28          NR
21         K 0.2990551 2018-11-28          NR
22         L 0.8361506 2018-11-28     Control
23         G 0.8274411 2018-12-05     Control
...
479         E 0.5348337 2022-06-15       PR3/4
480         F 0.4959483 2022-06-15       PR3/4
481         I 0.5295775 2022-06-15    Clearcut
482         B 0.4168112 2022-06-15          NR
483         K 0.4540260 2022-06-15          NR
484         L 0.6940871 2022-06-15     Control

The data recorded a NDVI (vegetation index i.e. greenness/vigor) time series over the span of four years. The Treatment are different study sites and Mangement1 different treatments. Here is a plot for a better understanding:

I want to find out, how Management1 is influencing the NDVI_mean values over time. Therefore, I have fit the data in a gls model:

gls_model <- gls(NDVI_mean ~ Management1 + date, correlation = corCAR1(form = ~date | Treatment), data = Data_2018_2022)

Generalized least squares fit by REML
  Model: NDVI_mean ~ Management1 + date 
  Data: Data_2018_2022 
        AIC       BIC  logLik
  -1337.034 -1303.677 676.517
Correlation Structure: Continuous AR(1)
 Formula: ~date | Treatment 
 Parameter estimate(s):
      Phi 
0.9850594
Coefficients:
                         Value  Std.Error    t-value p-value
(Intercept)         -1.7199592 0.28986021  -5.933754       0
Management1NR       -0.4107973 0.02250663 -18.252277       0
Management1PR1/2    -0.4161073 0.02250663 -18.488207       0
Management1PR3/4    -0.3806264 0.02250663 -16.911743       0
Management1Clearcut -0.4365251 0.02054565 -21.246592       0
date                 0.0001345 0.00001567   8.588285       0
Correlation: 
                    (Intr) Mng1NR M1PR1/ M1PR3/ Mngm1C
Management1NR       -0.039

Management1PR1/2    -0.039  0.500

Management1PR3/4    -0.039  0.500  0.500

Management1Clearcut -0.043  0.548  0.548  0.548

date                -0.998  0.000  0.000  0.000  0.000
Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.7290950 -0.7069599 -0.1141165  0.6488433  3.2727331
Residual standard error: 0.07468751 
Degrees of freedom: 484 total; 478 residual

I find that the estimates look somewhat unreasonable looking at the plots of my data. If I fit the model in the following way: gls_model <- gls(NDVI_mean ~ Management1 + fid, correlation = corCAR1(form = ~ date | Treatment), data = Data_2018_2022) The summary() seems to be more accurate for what I understand:

Generalized least squares fit by REML
  Model: NDVI_mean ~ Management1 + fid 
  Data: Data_2018_2022 
        AIC       BIC   logLik
  -1322.891 -1289.534 669.4456
Correlation Structure: Continuous AR(1)
 Formula: ~date | Treatment 
 Parameter estimate(s):
      Phi 
0.9870169
Coefficients:
                         Value   Std.Error   t-value p-value
(Intercept)          0.6889661 0.021376286  32.23039       0
Management1NR       -0.4110561 0.025436529 -16.16007       0
Management1PR1/2    -0.4148831 0.025436178 -16.31075       0
Management1PR3/4    -0.3794014 0.025436108 -14.91586       0
Management1Clearcut -0.4346909 0.023219982 -18.72055       0
fid                  0.0002791 0.000042286   6.60065       0
Correlation: 
                    (Intr) Mng1NR M1PR1/ M1PR3/ Mngm1C
Management1NR       -0.592

Management1PR1/2    -0.596  0.500

Management1PR3/4    -0.595  0.500  0.500

Management1Clearcut -0.653  0.548  0.548  0.548

fid                 -0.540 -0.006  0.002 -0.001  0.003
Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.80656156 -0.80840445 -0.07411791  0.62982805  3.34187612
Residual standard error: 0.0794082 
Degrees of freedom: 484 total; 478 residual

Also, if I fit the model gls_model <- gls(NDVI_mean ~ Management1, correlation = corCAR1(form = ~date | Treatment), data = Data_2018_2022), the output of the summary() seem to be valid, however, the standardized residuals vs. fitted values plot has a pattern.

Question 1: Do I interpret the output of the second model correctly by saying, that the (Intercept) value is the mean value of the NDVI_mean in Management1Control when Management1 was equal to 0 and e.g. Management1NR would be equal to the Intercept minus -0.4227368 ?

Question 2: How is it possible, that the model seems to produce a (from my understanding) valid output only if I include the fid in the model ?

EDIT I followed the suggestion of @EdM and fitted the model gls_model <- gls(NDVI_mean ~ Management1 * date, correlation = corCAR1(form = ~date | Treatment), data = Data_2018_2022) The summary() now is

eneralized least squares fit by REML
  Model: NDVI_mean ~ Management1 * date 
  Data: Data_2018_2022 
        AIC       BIC   logLik
  -1289.247 -1239.312 656.6233
Correlation Structure: Continuous AR(1)
 Formula: ~date | Treatment 
 Parameter estimate(s):
      Phi 
0.9801462
Coefficients:
                             Value Std.Error   t-value p-value
(Intercept)               0.598611 0.5475773  1.093199  0.2749
Management1NR            -2.691958 0.7743913 -3.476225  0.0006
Management1PR1/2         -2.344346 0.7743913 -3.027341  0.0026
Management1PR3/4         -2.427570 0.7743913 -3.134811  0.0018
Management1Clearcut      -4.913910 0.7069193 -6.951162  0.0000
date                      0.000009 0.0000296  0.306608  0.7593
Management1NR:date        0.000123 0.0000419  2.944852  0.0034
Management1PR1/2:date     0.000104 0.0000419  2.488846  0.0132
Management1PR3/4:date     0.000111 0.0000419  2.640983  0.0085
Management1Clearcut:date  0.000242 0.0000383  6.331952  0.0000
Correlation: 
                         (Intr) Mng1NR Mn1PR1/2 Mn1PR3/4 Mngm1C date   Mn1NR: M1PR1/2: M1PR3/4:
Management1NR            -0.707

Management1PR1/2         -0.707  0.500

Management1PR3/4         -0.707  0.500  0.500

Management1Clearcut      -0.775  0.548  0.548    0.548

date                     -1.000  0.707  0.707    0.707    0.774

Management1NR:date        0.707 -1.000 -0.500   -0.500   -0.548 -0.707

Management1PR1/2:date     0.707 -0.500 -1.000   -0.500   -0.548 -0.707  0.500

Management1PR3/4:date     0.707 -0.500 -0.500   -1.000   -0.548 -0.707  0.500  0.500

Management1Clearcut:date  0.774 -0.548 -0.548   -0.548   -1.000 -0.775  0.548  0.548    0.548
Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.44410198 -0.65724392 -0.07158163  0.69272402  2.54105053
Residual standard error: 0.06669597 
Degrees of freedom: 484 total; 474 residual

None of these models is valid. None includes the most important repeated measure here, the multiple measurements within each level of the Treatment variable. Treatment, not date, should be after the "|" symbol in your formulas. To handle autocorrelations over time, you might include something like ~ date|Treatment, with date kept as a fixed predictor. For the best ways to formulategls models, see Chapter 7 of Frank Harrell's Regression Modeling Strategies. — EdM, Feb 29 '24 at 15:49
The additive term for date doesn't allow for the slope with respect to date to differ among Management1 values, yet the plots show very different slopes. You need a Management1*date interaction term for that. The fid numbers are just arbitrary indices of the order of measurements and shouldn't be used at all. — EdM, Feb 29 '24 at 16:56
Thanks again for your suggestion. That makes a lot of sense. However, the result seems not to be valid, if I am interpreting correctly (as I tried to decsribe above). — Vik123, Feb 29 '24 at 17:46

EdM · Accepted Answer · 2024-02-29T20:42:18.343

The final model

gls(NDVI_mean ~ Management1 * date, correlation = corCAR1(form = ~date | Treatment), data = Data_2018_2022)

seems to be OK and to do a reasonable job of fitting your data, although it only allows for linear changes in outcome over time.

The coefficient values seem weird because of how R encodes your date values. It uses the "number of days since 1970-01-01." Thus your earliest date of 2018-10-11 is already at a date value of 17185. The (Intercept) and the individual coefficients for the Management1 levels, however, are the estimates at date = 0, an extrapolation to times that are long before your study.

That said, if you use the model to get estimates corresponding to your actual dates, all seems pretty much OK.

For example, the Control values are reasonably constant over time, at a value of about 0.76. The low coefficient for date (representing the slope specifically for Control) is consistent with that. If you start with an (Intercept) of about 0.60 (the estimated value for Control at date = 0) and add in a slope of 0.000009 over the 17185 days since then, you get

0.60 + 17815*0.000009
## [1] 0.760335

That reconstructed the Control starting value OK.

The interaction coefficients between date and each of the other levels of Management1 represent the differences from Control in slope. Consistent with your plots, each of those interaction terms is positive, with the apparent order in the slopes of your plots consistent with the coefficient values.

The single coefficients for each of the Management1levels also seem to be OK, once you take into account all the coefficients involved. I checked this in particular for the NR level. To get the estimate for NR at the start of your study, you need to start with the (Intercept) (0.60), add in the coefficient for NR (-2.69, the difference between NR and Control at date = 0), and the estimated linear change over the intervening 17815 days (17815 times the sum of the date and Management1NR:date coefficients):

0.60 -2.69 + 17815*(0.000009 + 0.000123)
## [1] 0.26158

That agrees quite well with the starting value for NR in your plot.

The increase in that value for NR over the 1343 days of your study would be:

1343 * (0.000009 + 0.000123)
## [1] 0.177276

again agreeing well with your data.

The reason that you thought that you might have had a useful model when your used the fid values as a predictor and omitted date as a fixed predictor is that the fid values are very highly correlated with the date values, in the particular way that you set up your data. When using some arbitrary set of predictor values seemingly unrelated to what you thought you were studying seems to "work," take that as a sign that you need to re-evaluate what you are doing.

Thank you so much for taking the time for this incredible explanation. I have added a variable with days passed to make things easier, and now everything makes sense. Thank you for the clarification. — Vik123, Mar 01 '24 at 13:49

GLS Function - Fitting and Interpretation Issues

1 Answers1