Association of soil toxic element with epigenetic aging in four different cities

Question

I would like to do regression analysis (linear/non-linear) using gam models for soil toxic elements and epigenetic age acceleration; here, we have measured soil elements (mg/kg) using ICP-MS in four different cities.

I have used the below code with the interaction (cities) term. Still, it has been advised that this is not the correct way for modelling interaction term so now I am trying to use gam model with interaction term for linear and non-linear models. But I'm not sure how to do that. I have added gam model below for linear and non-liner without interaction term using gam, but could you please help me how to improve this for interaction term (cities) and also how to mention covariates like Sex, Age, and Smoking_Status in gam models.

raw <- read.table("clean.raw.elements.txt", header=TRUE, sep="\t")
colnames(raw)
model1 <- lm (EpigeneticAgeAcceleration_EEAA ~ 0 + cities + 
    cities:( Zn + Cu + Hg + Mo + Pb + Smoking_Status + Sex + Age), 
    data = raw)
summary_model1 <- summary(model1)
summary_model1
Coefficient estimates
coefficients <- coef(summary_model1)
Confidence intervals
conf_intervals <- confint(model1)
Combine coefficients and confidence intervals
combined_output <- cbind(coefficients, conf_intervals)
combined_output
write.table(combined_output, "PCGrimAge_EEAA_elements.txt", sep=",")

gam models without interaction term; cities

> linear_model <- gam(EpigeneticAgeAcceleration_EEAA ~ Hg, data=ph1)
> smooth_model <- gam(EpigeneticAgeAcceleration_EEAA ~ s(Hg), 
                      data=ph1)
> AIC(linear_model, smooth_model)
                   df      AIC
linear_model 3.000000 9721.443
smooth_model 3.314677 9721.311

I have also tried this, but I am not sure how to get effect size and other coefficients for all four cities;

m.factor <- gam(PCGrimAge_EEAA ~ Hg + te(Hg, by=cities), data=ph1)
> 
> summary(m.factor)
Family: gaussian 
Link function: identity
Formula:
PCGrimAge_EEAA ~ Hg + te(Hg, by = cities)
Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)

(Intercept)     1.4009     0.6086   2.302   0.0215 *
           Hg  -1.6693     0.6942  -2.405   0.0163 *

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
                             edf Ref.df     F p-value

te(Hg):sant              0.7654 0.7552 4.229  0.0741 .
te(Hg):baflea            0.7552 0.7552 0.001  0.9813

te(Hg):panipa            0.7552 0.7552 6.989  0.0217 *
te(Hg):kurus             3.3627 3.6542 1.385  0.3273

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Rank: 17/18
R-sq.(adj) =  0.00168   Deviance explained = 0.532%
GCV = 16.761  Scale est. = 16.691    n = 1813

How many epigenetic_age_acceleration observations are there? Is there just 1 measurement of each element for each of 4 cities? If so, the quartiles for each element will be given simply by the rank orders of that element among the cities. A table showing the values of the elements for each of the cities would be helpful in that case. Please provide that information by editing the question, as comments are easy to overlook and can be deleted. Also, if your related question is no longer needed, please help keep this site tidy by deleting it. — EdM, Feb 14 '24 at 15:17
Thanks, there are around 1500 individual for which we have calculated age acceleration by regression the chronological age. these participant belongs to four cities and soil samples are collected from individual residence area from four cities to measure the toxic elements. so now looking for association between toxic cement and age acceleration. — bioinfonext, Feb 14 '24 at 17:13
Thanks again @EDM, I have fitted qubic spline above but not sure should I use df or knot here. If I need to use knot then how to find value what I need to put for each element as each element as different raw concentration(mg/kg) in soil. And also how to plot the spines?I have also updated the emmeans code. — bioinfonext, Feb 17 '24 at 15:39
also let you know that relationship for some of the elements with epigenetic age acceleration might not be linear. at low concentration they might be essential for body so might decelerate epigenetic age and at higher concentration might be toxic to body so may cause epigenetic age acceleration. — bioinfonext, Feb 17 '24 at 17:57
Thanks, I also want city effect along with toxic elements (city*toxic elements interaction). I have coded sex as 0=female, male=2 and same with smoking as number. — bioinfonext, Feb 18 '24 at 17:52
I am using sex, age, smoking as coriates so not sure if need to use them while fitting natural splines. — bioinfonext, Feb 18 '24 at 18:00
Thanks @EdM, updated code now, I am using age, sex, smoking as covariates now. — bioinfonext, Feb 18 '24 at 18:28
I was trying to use gam model but getting error , could you please have look? Many thanks for your time and all help. — bioinfonext, Feb 19 '24 at 13:07
It would help if you would edit the question to include in it the information about the numbers of cities and observations, and to clarify whether "city" and "buffer zone" mean the same thing and, if not, how they are related. It also would help if you could show summary_model1 and quality-control plots for model1, which might be working well enough for this particular set of data and model. This page discusses how to think about and include interactions in GAMs if that's what you need; also see pages linked from there. — EdM, Mar 03 '24 at 16:17
Thank you so much EdM. Edited now. Could please suggest how to plot quality control plot for model1. — bioinfonext, Mar 04 '24 at 06:01
Your gam() model is incorrect; you need to include cities as a separate term, not just in the te() smooth. See this answer. I'm also not sure that you should include Hg outside the smooth. A simple plot(model1) is a standard initial assessment of a lm() fit; see this page. Other more specific plots are outlined on this page. — EdM, Mar 04 '24 at 15:31
I am not able to understand EdM. How to do linear and non linear gam code with interaction. Please help! — bioinfonext, Mar 05 '24 at 06:35
Not sure should we use s or te ? And how to use these with interaction. — bioinfonext, Mar 05 '24 at 06:43

EdM · Answer 1 · 2024-03-07T17:39:26.200

Substantive issues

Before getting into details of the choice of model type and coding, you need to make some decisions about what you want to model. As I understand the situation, you have epigenetic_age_acceleration (EAA)values for about 1700 individuals across 4 cities. Each individual has values for Smoking_Status, Sex and Age, and measurements of each of 5 heavy metal concentrations taken from near their residences. You want to model EAA as a function of those predictors.

First, both your ordinary least squares (OLS) model and your (incorrectly formulated) generalized additive model (GAM) suggest that you want to include interactions of each of the other predictors with the categorical predictor cities. It's not particularly clear what you hope to accomplish by that. You will not get a generalizable answer about the overall association of the other predictors with EAA. Rather, you will effectively get a set of 4 separate models, one for each value of cities. That might describe your data adequately but it's not clear how you would even think about extending those results to new situations.

To this non-expert, it seems that your model would be more useful if you used cities as a categorical predictor to allow for systematic overall differences in EAA among cities, but not to include interactions of cities with the other predictors unless you have strong theoretical reasons to expect that the value of cities per se will affect the associations of other predictors with EAA.

Second, both of your proposed models do not include any interactions among the predictors other than cities. They thus do not allow for combinations of heavy metals to have different associations with EAA than the sums of their individual associations, or for the effects of the heavy metals to depend on Age or Sex or Smoking_Status. That seems to be very limiting.

Before you continue, I recommend careful study of the principles of regression modeling, for example in Frank Harrell's Regression Modeling Strategies (RMS). Chapter 4 seems directly relevant to decisions about the complexity of the model that you might be able to fit with 1700 observations, and Chapter 2 discusses strategies for flexible modeling of your continuous predictors (heavy metal concentrations and Age). Some choices need to be made based on your understanding of the subject matter.

OLS or GAM

Once you have decided what you want to model, both OLS and GAM can allow for flexibly fitting nonlinear functions of continuous predictors. Chapter 7 of An Introduction to Statistical Learning (ISLR) outlines the approaches. Regression splines most naturally fit into an unpenalized OLS model. They can be thought of as a type of GAM, but a GAM is typically thought of as a more general approach allowing for several types of flexible modeling, with smoothness of the fits determined by penalization. Although ISLR emphasizes separate modeling of each continuous predictor in a GAM, the gam() function in the R mgcv package can accept smoothing terms that effectively incorporate interactions among predictors.

OLS with regression splines can work quite well. Frank Harrell notes:

As an aside, I have found little advantage of GAM's over parametric additive regression spline models, which give simpler formal tests and confidence intervals, and provide formulas for prediction.

OLS

Consider an extension of the following type of unpenalized OLS model. I recommend the rms package for this type of work, as its rcs() spline implementation has better default knot placement in my opinion than splines::ns() and (once you learn how to use it) the package provides a coherent framework for analysis, presentation and prediction.

library(rms)
olsModel <- ols(EAA ~ cities + Sex + rcs(Age) + Smoking_Status + rcs(Zn) + rcs(Cu) + rcs(Hg) + rcs(Mo) + rcs(Pb))

The default rcs() terms use 4 knots (3 coefficients to estimate) for each continuous predictor. If Smoking_Status is binary, then this model only requires estimating (beyond the intercept) 3 coefficients for cities, 1 for Sex, 3 for Age, 1 for Smoking_Status and 15 for all 5 heavy metals (3 each). In practice, you might want to consider working with log transformations of the heavy metal concentrations instead of their raw values, but still using splines for fitting them.

The above is the type of "saturated model" that Harrell recommends as a starting point in Section 4.1.1 of RMS.

With 1700 observations you could consider a much richer OLS model with interactions; you might envision a model with up to about 100 coefficients without overfitting. For example, including an interaction of Sex with each of the spline-fit heavy metals would only add 15 more predictors; the same for Smoking_Status if that's binary.

You seem to have enough data to fit a full interaction between cities and splines of the heavy-metal concentrations. For example, if you want those but you don't want interactions involving Sex, Age, or Smoking_Status you could write:

olsCityInteraction <- ols(EAA ~  Sex + rcs(Age) + Smoking_Status + cities*(rcs(Zn) + rcs(Cu) + rcs(Hg) + rcs(Mo) + rcs(Pb)))

That would involve 1 coefficient for Sex, 2 for Smoking_Status (3-level categorical), 3 "main effect" coefficients for the 4 cities, and 60 for the 5 splines and their interactions with cities (3 coefficients for each heavy metal for each city).

In general, I find it safest to use "*" for interactions instead of trying to specify particular interactions with ":" as you did. That ensures that all lower-level terms are also included appropriately. The anova() function for rms objects (like those produced by ols()) provides a very helpful summary of overall significance of each predictor, including its nonlinear terms and interactions.

Sometimes interactions aren't needed beyond the linear components of the splines (specified by "%ia%" instead of "*" in rms syntax), substantially cutting down on the number of coefficients. Restricting to linear interactions could allow you to evaluate interactions among the heavy metals. Or, if you really want to do so despite my misgivings, to include interactions of cities with the heavy metals.

I very strongly recommend that you learn to use the rms package if you are going to be doing regression analysis. It has a bit of an initial learning curve, but once you get beyond that it makes work much easier.

If you want to stick with basic R functions, in the above examples you could just use lm() instead of ols() and use ns() instead of rcs() for the splines (after loading the built-in splines package).

GAM

The main potential advantage that I see with a GAM of your data instead of OLS is if you want to allow for nonlinear interactions among the heavy metals. All of the simpler interactions of heavy metals with the other predictors, or even linear interactions among them, can be handled by OLS as noted above.

If that's what you want, heed this warning from Wayne:

I'd emphasize that GAMs are much more flexible than GLMs, and hence need more care in their use. With greater power comes greater responsibility... You don't have to understand exactly how GAMs work to use them, but you really need to think about your data, the problem at hand, your software's automated selection of parameters like smoother orders, your choices (what smoothers you specify, interactions, if a smoother is justified, etc), and the plausibility of your results.

If you want to use GAMs you have to devote some serious effort. It's not just a matter of how to code them; it's a matter of understanding enough about them to make intelligent choices.

Happily, there are resources to help with that. The help page for gam.models in mgcv deserves careful study. As best as I can tell, it seems to contain answers to all your questions.

For example, a model like this:

y ~ x1 + x2 + s(x3) + s(x4,x5)

treats x1 and x2 as unpenalized, uses a single s() smooth for x3, and a two-dimensional s() smooth combining x4 and x5, using "the default basis for the smooths (a thin plate regression spline basis for each), with automatic selection of the effective degrees of freedom for both smooths." That said, "The above assumes that x4 and x5 are naturally on similar scales... If this assumption is false then tensor product smoothing might be better (see te)."

If you want a "main effect plus interaction" structure:

Such models should be set up using ti terms in the model formula. For example:

y ~ ti(x) + ti(z) + ti(x,z)

For interactions of continuous and categorical predictors:

by variables are the means for... letting smooths ‘interact’ with factors or parametric terms... Note that when using factor by variables, centering constraints are applied to the smooths, which usually means that the by variable should be included as a parametric term, as well.

That's where your GAM interaction model was in error: you did not include cities as a parametric term (outside the smooth). I'm also not sure how including Hg as both a smoothed and as a parametric predictor would work.

If you need to go beyond that, there are nearly 1000 questions on this site tagged generalized-additive-model. The over 100 frequently viewed pages probably contain answers to other questions that you might have. In particular, Gavin Simpson is an expert in GAM who has contributed many helpful answers.

Finally, "Generalized Additive Models" by Simon Wood provides a good combination of theoretical and practical material.

thank you so much EdM, I just want interaction and main effect of cities and heavy metals. Other variable I want to use as covariates in the model; Age, sex (0 female, 1 male, Smoking_Status (1=never, 2=current, 3= past smoker). — bioinfonext, Mar 06 '24 at 06:59
thanks EdM, for all your help and time. I am not able to understand about rms code (unpenalized OLS model) above. how to mention data file in this code and also Age, Sex, and smoking we are using as covariates, so do we need to mention spline for these but not sure how? I coded smoking as (1 =never, 2, current smoker, 3 past smoker, sex as male=1, female=0 and Age as actual age from birth) — bioinfonext, Mar 07 '24 at 14:36
@bioinfonext you need to study more on how to set up regression models in R and how the modeling works. The setup is basically the same for many functions. For example, you typically include something like data=myData in a function call to let the function know to look for the values in a data frame called myData. Read the help pages. Even if you get the code that you think you want via questions on this site, at some point you are going to have to defend your work to others. If you don't also understand the basics of what the functions are doing, you won't be able to defend your work. — EdM, Mar 07 '24 at 17:46

Association of soil toxic element with epigenetic aging in four different cities

Coefficient estimates

Confidence intervals

Combine coefficients and confidence intervals

1 Answers1