5

Background:

We have biological gene expression data of cells (from a single-cell experiment). So, the data is in the form of a gene by cell matrix, where each value is the expression of a specific gene in a specific cell. I am seeking to establish genes that "go together" with another (continuous) property of the cells (methylation score for each cell).

The first choice of test (an obvious and simple one) was the Spearman correlation test. It could be run for each gene separately, and then the relevant genes only could be deduced.

However, it was found that there is a technical variable (a measurement of the quality of the cells) that is also correlated both with the two. It is correlated with both the methylation score, and with gene expression (at least for some of the interesting genes).

The obvious second choice of test seemed to be performing a multiple regression, with the two variables expressed as: methylation_score ~ gene_expression + cell_quality

If I understand correctly, it should give precisely the influence of gene expression on the score of interest, controlled for cell quality.

Problem

I have no reason to assume that the relationship between gene expression and the score of interest (methylation) is necessarily linear...

Question

Is there a widely-used statistical method that establishes the relationship between two variables when that relationship is not linear - and which controls for a third variable?

Sam
  • 537
  • 2
    What do you mean by multi+linear regression? If that is simply multiple regression (that is, linear regression with more than one predictor), please edit. Otherwise, maybe look into splines, and generalized additive models. Search this site – kjetil b halvorsen Dec 12 '23 at 14:16
  • 1
    What exactly is "cell quality"? I'm not sure this variable belongs in modeling. – AdamO Dec 12 '23 at 14:17
  • @kjetilbhalvorsen Edited. – Sam Dec 12 '23 at 14:28
  • I would like more information on the measure of "cell quality", does it indicate by any chance how reliable the gene expression data is ? Is it continuous ? – CaroZ Dec 12 '23 at 15:54
  • @AdamO The "cell quality" is the total number of genes that are expressed in that specific cell. If the total number of genes per cell is very low, it indicates that the quality of the cell is not good. Technically, It is the nGene parameter. Why would that not belong in modeling ? – Sam Dec 14 '23 at 16:30

2 Answers2

5

Estimating and Controlling Nonlinear Trends

I realize you have already accepted an answer here, but I wanted to address what I perceive as unanswered parts of your question here. First, regarding this part:

Is there a widely-used statistical method that establishes the relationship between two variables when that relationship is not linear - and which controls for a third variable?

The most obvious way to to model the linear relationship between two predictors, $x_1$ and $x_2$, and a dependent variable $y$, is what you already said: a multiple linear regression. This is because it models the conditional mean after partialing the variance of each effect independent of the other effects, such as the following equation:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon $$

We can do this because we simply can multiply an estimated coefficient by zero and leaving the rest of the equation to predict the outcome. However, per your other question:

Is there a widely-used statistical method that establishes the relationship between two variables when that relationship is not linear - and which controls for a third variable?

The most obvious way of modeling a nonlinear multiple regression would be some kind of polynomial or spline-based method for approximating this relationship between the predictor and response. As an example, we can just fit the terms $\beta_1 x_1$ and $\beta_2 x_1^2$ as a quadratic trend in our regression. Alternatively, if we wanted to construct a piecewise function $f(x)$ that is a construction of basis functions (such as a combination of linear and polynomial terms together) for a given predictor (where $b_j$ is a given basis function):

$$ f(x) = \sum\limits_{j=1}^{k}b_j(x)\beta_j $$

We get a much more flexible approximation of our $x_1$ and $x_2$ than we would a simple polynomial because now we can "mix and match" until we get a smooth function that best fits the data. Additionally, one can not only do this in the simple case of a bivariate relationship, this can also be extended to a multiple regression context, where the piecewise functions in the regression can be represented as:

$$ y = f(x_1) + f(x_2) + \epsilon $$

Thus here we achieve the same thing as the linear case, only now we have nonlinear functions of each predictor rather than linear ones (though the parameters are still technically "linear" here). This is the basis of a generalized additive model (GAM), which models nonlinear data directly in the same ways we would consider with multiple regression. Now of course as alluded to earlier, one could just use a multiple regression with polynomials or other related terms, such as the below model with quadratic terms and the full linear equation:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 x_2 + \beta_4 x_2^2 + \epsilon $$

However, it is often the case that with simple polynomials one achieves a poor interpolation as the degree of polynomial increases or with sparse data at the ends of the distribution. Even simple splines can overfit the noise if one is not careful with the degrees of freedom they employ, etc. The magic of GAMs is that they include a penalized version of nonlinear regression to circumvent this, where the penalty is weighted by an estimation of the second differences with respect to the residual sum of squares and smoothing parameter $\lambda$:

$$ \Vert{y-X\beta}\Vert^2 + \lambda \sum_{j=2}^{k-1}\{ f(x^*_{j-1})-2f(x^*_j) + f(x^*_{j+1}) \}^2 $$

Thus in some respects, we get a "safer" nonlinear regression than something like a polynomial regression or otherwise.

Simulated Example

To demonstrate, we could construct some simulated data which predicts $y$ in a nonlinear way using the GAM package mgcv in R.

#### Simulate and Plot Data ####
set.seed(123)
x1 <- runif(100,0,1)
x2 <- runif(100,0,1)
y <- log(x1) + log10(x2) + rnorm(100, sd = .01)
df <- data.frame(x1,x2,y)
pairs(df)

The plotted data shows that $x_1$ and $x_2$ have different log-forms of the relationship with $y$ and are uncorrelated with each other:

enter image description here

Then we can simply fit the model:

#### Fit GAM ####
library(mgcv)
fit <- gam(
  y ~ s(x1) + s(x2),
  data = df,
  method = "REML"
)

Summarize and Plot

summary(fit) plot(fit,pages=1)

Summarizing the model, we get estimated curvilinearity of our functions, model $R^2$, and other important metrics:

Family: gaussian 
Link function: identity

Formula: y ~ s(x1) + s(x2)

Parametric coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.46620 0.01149 -127.6 <2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms: edf Ref.df F p-value
s(x1) 8.272 8.850 607.4 <2e-16 *** s(x2) 8.375 8.885 222.6 <2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = 0.988 Deviance explained = 99% -REML = -34.978 Scale est. = 0.013201 n = 100

We can also just simply plot the estimated model and see if it matches our data:

enter image description here

And it appears to do just that. One can include different smoothing penalties, splines, knots, and any number of other adjustments to the regression to make the fit "better." Thus GAMs are a flexible framework that can be applied to your problem.

4

Your question is how to extend inference on the Spearman correlation to a multivariate setting. Actually, Spearman does not test for non-linear trends, it tests for monotonic trends - of which a linear trend is a type of monotonic trend. One aspect of a monotonic trend is that the average trend across the span of the covariate has a net positive or a net negative direction. For instance, a curvilinear, sigmoidal positive trend would still be characterized by an average increase per unit difference in the covariate.

Therefore a sound and completely agnostic approach to estimating this effect would be to use multivariate regression with robust standard errors. It's well known that in a variety of settings, robust errors provide asymptotically correct inference for trends.

enter image description here

library(sandwich)
library(lmtest)

set.seed(123) par(mfrow=c(1,3)) curve(plogis(x), from=-3, to=3, xlab='X', ylab='invlogit(x) + N(0,1)') x <- seq(-3, 3, by=0.1) o <- replicate(1000, { y <- plogis(x) + rnorm(length(x)) p <- coeftest(lm(y ~ x), vcov. = function(x) vcovHC(x, type='HC0'))[2,4] p }) abline(lm(plogis(x) ~ x), col='blue') mtext(side=3, paste('Power for 0.05 test:', formatC(mean(o < 0.05), format='f', digits=2))) curve(plogis(x + 2), from=-3, to=3) o <- replicate(1000, { y <- plogis(x+2) + rnorm(length(x)) p <- coeftest(lm(y ~ x), vcov. = function(x) vcovHC(x, type='HC0'))[2,4] p }) abline(lm(plogis(x+2) ~ x), col='blue') mtext(side=3, paste('Power for 0.05 test:', formatC(mean(o < 0.05), format='f', digits=2))) curve(plogis(x5), from=-3, to=3) o <- replicate(1000, { y <- plogis(x5) + rnorm(length(x)) p <- coeftest(lm(y ~ x), vcov. = function(x) vcovHC(x, type='HC0'))[2,4] p }) abline(lm(plogis(x+2) ~ x), col='blue') mtext(side=3, paste('Power for 0.05 test:', formatC(mean(o < 0.05), format='f', digits=2)))

AdamO
  • 62,637
  • 2
    I think the mention of Spearman is incidental; the question is more squarely on how to capture nonlinear relationships. – rolando2 Dec 12 '23 at 14:31
  • 3
    @rolando2 AdamO put it perfectly. I was interseted to know how to capture monotonic relationships in a multivariate setting. – Sam Dec 12 '23 at 14:36
  • 1
    Have you read our threads about monotonic regression? – whuber Dec 12 '23 at 15:55
  • @AdamO could you add to the post an explanation of what do the plots demonstrate? – Sam Dec 14 '23 at 16:33
  • @Sam here I'm showing a "true" trend (black) versus the linear modeled trend via OLS (blue). I've chosen a sigmoidal "truth" under some shift/scale scenarios, but the errors are normally distributed per assumptions. In each scenario, you can see the blue trend correctly identifies a positive relationship (unit change in x associated with positive mean difference). The inference is well, or poorly, powered depending on some interesting features. Scenario 2 has an inflection point near the mean of "x" and the range of the expected Y is smaller compared to the error. – AdamO Dec 14 '23 at 18:21
  • Thanks. Is there a reason that you use specifically HC0? https://en.wikipedia.org/wiki/Heteroskedasticity-consistent_standard_errors says that "the HC3 specification appears to work best" – Sam Dec 18 '23 at 11:40
  • While I appreciate the spirit of this answer in terms of elucidating the purpose of the Spearman $\rho$, I agree with @rolando2 about the connection to the question, as it doesn't seem to address 1) the question about estimating two separate predictors and 2) the potentially nonlinear data present. – Shawn Hemelstrand Dec 18 '23 at 12:45
  • @Sam my preference to use HC0 is that it is the first and most intuitive heteroscedasticity-consistent estimator. Others have improved small sample performance, but here I am talking about an N of 63. You may play around with simulation code or other scenarios to decide which best fits your purpose. – AdamO Dec 18 '23 at 17:15
  • @ShawnHemelstrand 1) My answer states that the solution to the two-or-more factor problem is to use multilinear regression. The motivating example with one covariate illustrates sandwich error estimation. 2) I made a case for why, based on the problem description, OP in fact means monotonic rather than non-linear. Their comments agree with this point. I am borrowing from my own experience planning analyses for microarray experiments. – AdamO Dec 18 '23 at 17:18
  • Rereading this, your approach makes sense. I think your example would have been more clear to me had you provided multiple predictors in your simulation, but perhaps that isnt a major sticking point. – Shawn Hemelstrand Dec 18 '23 at 23:52
  • @AdamO what do you think of modeling the "cell quality" (described in the comment to the OP)? – Sam Dec 19 '23 at 12:53
  • @Sam I don't know your experiment well enough to advise. For a cancer study, you might consider using the cell quality as a weight if it might reflect precision in the measurement of outcome. – AdamO Dec 19 '23 at 15:01
  • Is robustbase:lmrob also an option? – Sam Feb 14 '24 at 18:20