3

I'm trying to fit a GLM on my dataset which consists of soil respiration data (RS), soil temperature (TEMP), soil water content (SWC), biomass (BIOM), day of the year when the sampling was done (DOY) and the vegetation type (grasslands, old fields, ploughland and oversewn grassland). The measurement was done along a 15 m long circular transsect of consecutive quadrats, in every 20 cm, so there are 75 measurements in a transsect.

enter image description here

The question is the relationship between soil respiration (RS) and the other variables (SWC, TEMP, BIOM, DOY and type of vegetation) so how the changes of the related variables influence soil respiration (e.g. if the temperature is increasing, will soil respiration also increase?). I am thinking about a model like this: glm(RS~SWC+TEMP+DOY+type).

The values of RS, TEMP and DOY are all above zero, but SWC and BIOM have zero values, and there are NAs in the BIOM variable. None of the variables are normally distributed and there is an order of magnitude difference between the variables.

enter image description here

How can I decide which family to use?

Thank you for the suggestions!

Edit: boxplot and histogram of the variables

enter image description here

enter image description here

enter image description here

Related question: Do I need to transform my variables for GLM?

  • 1
    What is your research question? – user2974951 Nov 24 '22 at 10:47
  • The question is the relationship between soil respiration (RS) and the other variables (SWC, TEMP, BIOM, DOY and type of vegetation) so how the changes of the related variables influence soil respiration (e.g. if the temperature is increasing, will soil respiration also increase?). I am thinking about a model like this: glm(RS~SWC+TEMP+DOY+type) – Zita Zimmermann Nov 24 '22 at 10:57
  • 1
    OK, so what does RS look like? What is its domain and distribution? Can you show some plots? – user2974951 Nov 24 '22 at 10:59
  • It may well be that a straightforward OLS is quite sufficient, even if your RS is constrained to be nonnegative. As @user2974951 writes, it would be good to see a few plots. Things that come to my mind: (1) You might want to use trigonometric transforms of DOY to tell your model that DOY=1 is similar to DOY=365. (2) I don't know what a transect is, but it sounds like you have repeated measurements, so a mixed model might be appropriate. (3) 29% NAs in BIOM is concerning. Have you looked at whether there are patterns? – Stephan Kolassa Nov 24 '22 at 11:15
  • @user2974951: I added a boxplot and a histogram of the RS – Zita Zimmermann Nov 24 '22 at 11:31
  • @Stephan Kolassa: (2) Yes, a transsect in this case means that the measurement was done along a 15 m long circular transsect of consecutive quadrats, in every 20 cm, that gives the 75 measurements. (3) There is no pattern of the missing biomass values, it simply means that we do not made biomass sampling on all transsects. – Zita Zimmermann Nov 24 '22 at 11:33
  • 1
    The choice of family is much less crucial than the choice of link which here surely starts with trying logarithms. On the face of it your measurements are for a Northern Hemisphere summer or a Southern Hemisphere winter, but I agree with @StephanKolassa that using sines and cosines of something like (doy - 0.5) / (365 or 366) is natural to tackle seasonality. As a geographer I do know what a transect is; your present set-up implies that you are treating different sites on your transect as independent replicates, and whether that is a sound idea is an open question. – Nick Cox Nov 24 '22 at 11:52
  • 2
    The distributions of the predictors (covariates) have no bearing on choice of GLM family. Whether they need transformation on other grounds is a different question. – Nick Cox Nov 24 '22 at 11:54
  • @NickCox: Thank you for the answer, I will try to tranform the DOY variable. I treat the different transsects as independent replicates. I made the sampling in 10 different sites across the country (Hungary), and the transsects in the dataset I show here were made on one of the sites, but randomly distributed within the site. – Zita Zimmermann Nov 24 '22 at 12:26
  • You are going to need a family of sines and cosines. https://journals.sagepub.com/doi/pdf/10.1177/1536867X0600600408 includes tutorial material. This stuff is trivial for experienced statisticians (I am not a statistician myself) but in almost no introductory texts or courses. – Nick Cox Nov 24 '22 at 12:34
  • @NickCox: Thank you! I'm very far from an experienced statistician, but I will try my best to understand it. – Zita Zimmermann Nov 24 '22 at 12:53
  • 1
    Please add new information in comments as an edit to the post! Especially the research question should be prominent at the beginning of the post! We want posts to be self-contained, comments are easily overseen (especially when there are so many as with this post=, and they can be deleted – kjetil b halvorsen Nov 24 '22 at 13:16
  • Also, please make a better (more informative) title, somehow mentioning soil ... – kjetil b halvorsen Nov 24 '22 at 15:22
  • On the face of it, it appears that Gamma regression would be a likely choice. The Gamma distribution can have 0 or positive, continuous, values, and can be right skewed. (en.wikipedia.org/wiki/Gamma_distribution). It's a family built in to the glm() function in the native stats package in R. ... That being said, depending on your audience, you might try a simple log transformation of RS instead. – Sal Mangiafico Nov 24 '22 at 17:29
  • @SalMangiafico As far as I know, Gamma is defined on $(0,\infty)$, so no zeros. – user2974951 Nov 25 '22 at 06:32
  • @SalMangiafico: Thank you, but, as user2974951 wrote, Gamma is not appropriate if you have zeros in the dataset. – Zita Zimmermann Nov 25 '22 at 12:29
  • If you want to go down the rabbit hole How to model non-negative zero-inflated continuous data? might have some useful tips. – user2974951 Nov 25 '22 at 13:02
  • Thanks, @user2974951 , yes, thanks for the correction, Gamma isn't defined for zeros. On that topic, also log transformation can't handle zeros. The zeros for SWC and BIOM might be adjusted to small values, as, at least for SWC, a zero value is unlikely. You might use a method like substituting half the nominal detection limit for a zero value. – Sal Mangiafico Nov 25 '22 at 13:33

1 Answers1

2

Your dependent variable RS could likely be handled with a log transformation or modeled as conditionally Gamma-distributed.

In this case, there would be a question with how to handle zero values. One thing to consider is if these values are actually zero, or if they are simply below some nominal detection level. I imagine that soil respiration in a natural system would unlikely to be precisely zero, but there could be situations where this is true. One approach for left-censored data is to simply substitute small values for zero values. It's clear that the zeros account for a relatively small proportion of your observations. You might see section 4.7 in the USEPA document below for some simple guidelines for substituting values for observations below the detection limit.

Also, ordinary least squares (OLS) regression may work for your situation. You might construct the model and examine the residuals. It may be that RS is close enough to conditionally normal for this to work fine.

The question as to whether to transform other variables is a separate question.

USEPA. 2000. Guidance for Data Quality Assessment: Practical Methods for Data Analysis, EPA QA/G-9, QA00 Update. https://www.epa.gov/sites/default/files/2015-06/documents/g9-final.pdf

Sal Mangiafico
  • 11,330
  • 2
  • 15
  • 35