I have some biomarkers ($X_1, \ldots, X_5$) and I want to model an outcome ($Y$) using these biomarkers. The biomarkers are correlated. So I decided to use a ridge regression to stabilize the coefficients. The final goal is to make a composite of these biomarkers (as a biomarker load score for patients) using the ridge regression derived weights and show its correlation with the outcome. To make my question easy to understand, I simulate an example here.
library(MASS)
M <- matrix(0,6,6)
diag(M) <- 1
M[1,2] <- -0.2; M[2,1] <- -0.2
M[1,3] <- -0.5; M[3,1] <- -0.5
M[1,4] <- -0.3; M[4,1] <- -0.3
M[1,5] <- 0.6; M[5,1] <- 0.6
M[1,6] <- -0.2; M[6,1] <- -0.2
M[3,4] <- 0.5; M[3,5] <- 0.3; M[3,6] <- 0.2
M[4,3] <- 0.5; M[4,5] <- 0.3; M[4,6] <- 0.2
M[5,3] <- 0.3; M[5,4] <- 0.3; M[5,6] <- 0.2
M[6,3] <- 0.2; M[6,4] <- 0.2; M[6,5] <- 0.2
data <- mvrnorm(200, rep(0,6), M, empirical=TRUE)
library(glmnet)
X <- data[,-1]; Y <- data[,1]
set.seed(123)
cvlr <- glmnet::cv.glmnet(x = X, y = Y, family = "gaussian", alpha = 0, standardize=F)
lr <- cvlr$glmnet.fit
> coef(cvlr, s = "lambda.min")
6 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 3.284406e-17
V1 -1.886524e-01
V2 -5.680591e-01
V3 -2.109487e-01
V4 8.235752e-01
V5 -1.970602e-01
coeff_lr <- coef(cvlr, s = "lambda.min")[,1]
composite <- as.numeric(X[,1:5]%*%coeff_lr[2:6])
> cor(Y, composite)
[1] 0.9956233
As you can see, although all but one ($X_4$) biomarkers were correlated negatively to the outcome, the composite shows a positive association if I create it that way. It happens because when a negative coefficient is multiplied with the values of the biomarker, the association between this part of the composite (e.g. $\beta_2X_2$) becomes positively associated with $Y$. One option is to simply reverse the sign of the values of the composite by multiplying with $-1$. However, that would now make the part of the composite that should be positively related to $Y$ (such as $\beta_4X_4$) as negatively related, which is equivalent to multiplying $X_4$ by $-1$ before running the ridge regression.
composite.rev.sign <- -as.numeric(X[,2:5]%*%coeff_lr[3:6])
> cor(Y, composite.rev.sign)
[1] -0.9753235
The other option to create the composite is by multiplying biomarkers with the absolute value of the coefficients. In that case, the parts that should be positively/negatively related remain associated similarly with $Y$. But I'm afraid it could cancel out two important positive and negative effects (because of plus and minus operations) and end up showing me there is no association of $Y$ with the composite.
composite.abs <- as.numeric(X[,2:5]%*%abs(coeff_lr[3:6]))
> cor(Y, composite.abs)
[1] 0.08175834
Note that, the correlation of the composite with $Y$ is positive in the above output, which could be because of the data I made-up or because of the large coefficient of the positive biomarker that cancels out the negative effects. I'm not sure if that is desirable (like in prediction, where cancellation of effects is fine) for creating a composite that would basically be used as a biomarker load of a patient. We could possibly have 3 negative and 2 positive coefficients.
Question:
What would be your suggestion for signs of beta weights in such a composite?
Edits:
I may need to explain a bit more. I have some covariates in addition to the biomarkers. I want to stabilize the beta coefficients by ridge because the biomarkers are correlated, where I also include the covariates (age, sex, GCS, etc.) to get adjusted beta's. The goal is to produce a weighted biomarker load score for the patients, which is why I thought of only using the beta's and X's corresponding to 'only the biomarkers' and did not include the terms corresponding to the intercept or the covariates so that we can essentially get a 'biomarker load score'.