How should I consider the signs of the beta weights in a composite?

Question

I have some biomarkers ($X_1, \ldots, X_5$) and I want to model an outcome ($Y$) using these biomarkers. The biomarkers are correlated. So I decided to use a ridge regression to stabilize the coefficients. The final goal is to make a composite of these biomarkers (as a biomarker load score for patients) using the ridge regression derived weights and show its correlation with the outcome. To make my question easy to understand, I simulate an example here.

library(MASS)
M <- matrix(0,6,6)
diag(M) <- 1
M[1,2] <- -0.2; M[2,1] <- -0.2
M[1,3] <- -0.5; M[3,1] <- -0.5
M[1,4] <- -0.3; M[4,1] <- -0.3
M[1,5] <- 0.6; M[5,1] <- 0.6
M[1,6] <- -0.2; M[6,1] <- -0.2
M[3,4] <- 0.5; M[3,5] <- 0.3; M[3,6] <- 0.2
M[4,3] <- 0.5; M[4,5] <- 0.3; M[4,6] <- 0.2
M[5,3] <- 0.3; M[5,4] <- 0.3; M[5,6] <- 0.2
M[6,3] <- 0.2; M[6,4] <- 0.2; M[6,5] <- 0.2
data <- mvrnorm(200, rep(0,6), M, empirical=TRUE)

library(glmnet)
X <- data[,-1]; Y <- data[,1]

set.seed(123)
cvlr <- glmnet::cv.glmnet(x = X, y = Y, family = "gaussian", alpha = 0, standardize=F)
lr <- cvlr$glmnet.fit
> coef(cvlr, s = "lambda.min")
6 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  3.284406e-17
V1          -1.886524e-01
V2          -5.680591e-01
V3          -2.109487e-01
V4           8.235752e-01
V5          -1.970602e-01
coeff_lr <- coef(cvlr, s = "lambda.min")[,1]

composite <- as.numeric(X[,1:5]%*%coeff_lr[2:6])

> cor(Y, composite)
[1] 0.9956233

As you can see, although all but one ($X_4$) biomarkers were correlated negatively to the outcome, the composite shows a positive association if I create it that way. It happens because when a negative coefficient is multiplied with the values of the biomarker, the association between this part of the composite (e.g. $\beta_2X_2$) becomes positively associated with $Y$. One option is to simply reverse the sign of the values of the composite by multiplying with $-1$. However, that would now make the part of the composite that should be positively related to $Y$ (such as $\beta_4X_4$) as negatively related, which is equivalent to multiplying $X_4$ by $-1$ before running the ridge regression.

composite.rev.sign <- -as.numeric(X[,2:5]%*%coeff_lr[3:6])
> cor(Y, composite.rev.sign)
[1] -0.9753235

The other option to create the composite is by multiplying biomarkers with the absolute value of the coefficients. In that case, the parts that should be positively/negatively related remain associated similarly with $Y$. But I'm afraid it could cancel out two important positive and negative effects (because of plus and minus operations) and end up showing me there is no association of $Y$ with the composite.

composite.abs <- as.numeric(X[,2:5]%*%abs(coeff_lr[3:6]))
> cor(Y, composite.abs)
[1] 0.08175834

Note that, the correlation of the composite with $Y$ is positive in the above output, which could be because of the data I made-up or because of the large coefficient of the positive biomarker that cancels out the negative effects. I'm not sure if that is desirable (like in prediction, where cancellation of effects is fine) for creating a composite that would basically be used as a biomarker load of a patient. We could possibly have 3 negative and 2 positive coefficients.

Question:

What would be your suggestion for signs of beta weights in such a composite?

Edits:

I may need to explain a bit more. I have some covariates in addition to the biomarkers. I want to stabilize the beta coefficients by ridge because the biomarkers are correlated, where I also include the covariates (age, sex, GCS, etc.) to get adjusted beta's. The goal is to produce a weighted biomarker load score for the patients, which is why I thought of only using the beta's and X's corresponding to 'only the biomarkers' and did not include the terms corresponding to the intercept or the covariates so that we can essentially get a 'biomarker load score'.

EdM · Answer 1 · 2020-02-25T01:58:39.930

A standard, general composite score from a linear model for a particular case is the linear-predictor value based on the covariate values (biomarkers) for the case and the model's regression coefficients:

$$ \beta_0 + \sum_{i=1}^N \beta_ix_i,$$ where $\beta_0$ is the intercept, $\beta_i$ is the coefficient for biomarker $i$, and $x_i$ is the value for biomarker $i$ for the case in question. That form of the linear predictor is the same for generalized linear models like logistic regressions or for Cox proportional-hazards models, although the link from the linear-predictor value to outcome is different in those types of regression from the direct relationship in a standard linear regression.

That's close to what you are producing in your code

as.numeric(X[,1:5]%*%coeff_lr[2:6])

except that you have omitted the intercept. Note that your formulation of the design matrix $X$ forces the intercept to be 0, which is not generally a wise choice for linear regression (although it is correct for a Cox proportional hazards model penalized via glmnet()).

I don't really see what is to be gained with a generic composite score based solely on the regression coefficient values. The coefficients tell you nothing about how a set of predictors is related to outcome unless you also know the values of the predictors. If a predictor's regression coefficient is negative then you certainly want a decrease of its value to be associated with an increase in the outcome variable.

If you want to report a general measure of how well your model works, look more directly at a measure of its predictive performance. ISLR is one very good introduction to these issues, including application of penalized methods like ridge regression.

Including a biomarker score with other clinical covariates

If your biomarkers are correlated with clinical covariates that are in turn associated with the outcome, then you run a risk of omitted-variable bias by separating the biomarker analysis from the clinical analysis. The problem is even greater if your regression is logistic or Cox; then the clinical covariates don't even need to be correlated with the biomarkers to bias the coefficients for the biomarkers. A more reliable approach would be to analyze the biomarkers and the clinical covariates together in a model of outcome. You could choose to use a ridge penalty on the biomarkers while not penalizing the clinical covariates, or penalize all predictors if you wish, via parameter settings in glmnet(). If you would like to report a score related to the biomarkers specifically, you could use their contribution to the linear predictor.

I may need to explain a bit more. I have some covariates in addition to the biomarkers. I want to stabilize the beta coefficients by ridge because the biomarkers are correlated, where I also include the covariates (age, sex, GCS, etc.) to get adjusted beta's. The goal is to produce a weighted biomarker load score for the patients, which is why I thought of only using the beta's and X's corresponding to 'only the biomarkers' and did not include the terms corresponding to the intercept or the covariates so that we can essentially get a 'biomarker load score'. — Blain Waan, Feb 24 '20 at 23:59
@BlainWaan I've added a paragraph with suggestions about how to proceed with a combination of biomarkers and clinical covariates. I would recommend analyzing all together to start, separating out the biomarker contribution to the linear predictor if you want a 'biomarker load score.' — EdM, Feb 25 '20 at 02:05
Thanks for your reply. I'm adjusting for the covariates. When you say, "to report a score related to the biomarkers specifically, you could use their contribution to the linear predictor", does that mean you think $\sum_{i=p+1}^N \beta_ix_i$ works fine as a 'biomarker load score', even though it can appear to be not associated with the outcome due to positive-negative effect cancellation, where $p$ is the number of covariates? In fact, think about the scenario where all $\beta$'s are negative. The score $\sum_{i=p+1}^N \beta_ix_i$ will turn out to be positively associated with the outcome. — Blain Waan, Feb 25 '20 at 04:18
@BlainWaan with several biomarkers, in some cases their contributions might offset each other so that they have 0 net association with outcome beyond what clinical covariates provide. You could call that "positive-negative effect cancellation" but that's just what you would expect and want from a linear model: a net contribution to outcome that takes all the predictors into account. See this answer for how 2 predictors each positively correlated with each other and with outcome can have multiple-regression coefficients of opposite signs. — EdM, Feb 25 '20 at 04:23

How should I consider the signs of the beta weights in a composite?

1 Answers1