I want to find some good predictors (genes). I have checked the Spearman correlation of the expression of each of 23 genes with dependent variables (responders Vs non-responders and I saw only the 5 genes identified initially showed significant correlation with the dependent variable.
This is my data, log transformed RNA-seq:
TRG CDK6 EGFR KIF2C CDC20
Sample 1 TRG12 11.39 10.62 9.75 10.34
Sample 2 TRG12 10.16 8.63 8.68 9.08
Sample 3 TRG12 9.29 10.24 9.89 10.11
Sample 4 TRG45 11.53 9.22 9.35 9.13
Sample 5 TRG45 8.35 10.62 10.25 10.01
Sample 6 TRG45 11.71 10.43 8.87 9.44
...
23 differentially expressed genes between responder patients (TRG12) and non-responder patients (TRG45).
1- I tested each of 23 genes individually in this code and each of them gives p-value < 0.05 remained as a good predictor; For example for CDK6 I have done
> glm=glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))
>
> summary(glm)
Call:
glm(formula = TRG ~ CDK6, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8286 -0.9317 0.6382 0.9500 1.5966
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 26.8950 9.6627 2.783 0.00538 **
CDK6 -2.6932 0.9765 -2.758 0.00582 **
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 77.347 on 55 degrees of freedom
Residual deviance: 64.062 on 54 degrees of freedom
AIC: 68.062
Number of Fisher Scoring iterations: 5
Finally I obtained five genes and I put them in this model:
final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))
but
> summary(final)
Call:
glm(formula = TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5696 -0.9090 0.2233 0.7125 2.0131
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.2258 13.3877 0.241 0.8096
CDK6 -2.6101 1.0860 -2.403 0.0162 *
CXCL8 0.8702 0.9783 0.889 0.3737
IL6 0.9591 0.8883 1.080 0.2803
ISG15 0.9574 0.5101 1.877 0.0605 .
PTGS2 -0.4623 1.1005 -0.420 0.6744
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 77.347 on 55 degrees of freedom
Residual deviance: 53.917 on 50 degrees of freedom
AIC: 65.917
Number of Fisher Scoring iterations: 6
I am just seeing CDK6 as significant. I done cross validation
> cv.glm(df, final, K=nrow(df))$delta
[1] 0.2130050 0.2125748
Likely the delta values are not greatly differ (Model is right ???)
Then I tried Sensitivity and Specificity
> threshold=0.5
> predicted_values<-ifelse(predict(final,type="response")>threshold,1,0)
> actual_values<-final$y
> conf_matrix<-table(predicted_values,actual_values)
> conf_matrix
actual_values
predicted_values 0 1
0 16 7
1 10 23
> sensitivity(conf_matrix)
[1] 0.6153846
> specificity(conf_matrix)
[1] 0.7666667
2- I went this way for step-wise method
full.model <- glm(TRG ~., data = df, family = binomial)
library(MASS)
step.model <- full.model %>% stepAIC(trace = FALSE)
coef(step.model)
> coef(step.model)
(Intercept) CDK6 CHGA CSF3 CXCL8 ERBB2 IL6 ISG15 KRT14 KRT17
-23770.6261 -469.3660 376.3528 -238.5697 1404.5905 590.2586 789.5295 842.2058 423.0402 -651.3403
MAGEA1 OSM PTGS2
496.6933 -637.0583 -500.1040
In addition to 5 genes another genes remains as predictors but likely none of them are significant by Pr(>|z|) > 0.05
> summary(step.model)
Call:
glm(formula = TRG ~ CDK6 + CHGA + CSF3 + CXCL8 + ERBB2 + IL6 +
ISG15 + KRT14 + KRT17 + MAGEA1 + OSM + PTGS2, family = binomial,
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.317e-04 -2.100e-08 2.100e-08 2.100e-08 1.396e-04
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -23770.6 2922188.9 -0.008 0.994
CDK6 -469.4 64920.6 -0.007 0.994
CHGA 376.4 46191.2 0.008 0.993
CSF3 -238.6 30857.1 -0.008 0.994
CXCL8 1404.6 172558.7 0.008 0.994
ERBB2 590.3 71989.5 0.008 0.993
IL6 789.5 100130.7 0.008 0.994
ISG15 842.2 102934.0 0.008 0.993
KRT14 423.0 52565.5 0.008 0.994
KRT17 -651.3 79489.6 -0.008 0.993
MAGEA1 496.7 60319.6 0.008 0.993
OSM -637.1 82121.8 -0.008 0.994
PTGS2 -500.1 73541.7 -0.007 0.995
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7.7347e+01 on 55 degrees of freedom
Residual deviance: 1.4741e-07 on 43 degrees of freedom
AIC: 26
Number of Fisher Scoring iterations: 25
3- I went to lasso regression
> library(glmnet)
> lassoModel <- glmnet(
+ x=data.matrix(df[,-1]),
+ y=df$TRG,
+ standardize=TRUE,
+ alpha=1.0,
+ family="multinomial")
> #Derive coefficients for each gene to each subtype
> co <- coef(lassoModel, s=idealLambda, exact=TRUE)
> co
$`TRG12`
24 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.441322326
ALB -0.010867331
AQP9 -0.051908908
CALML5 0.126359364
CCL4 -0.059869284
CDK6 0.283781437
CHGA -0.088283426
CREB3L3 0.109988758
CSF3 -0.012311176
CXCL5 -0.030532485
CXCL6 -0.063610112
CXCL8 -0.084385732
ERBB2 -0.102303095
FGFRL1 0.148759702
IL1B -0.072225606
IL6 -0.122671026
ISG15 -0.297386587
KLK5 0.039891341
KRT14 -0.004536929
KRT17 0.041693665
MAGEA1 -0.154075458
OSM 0.028987648
PTGS2 -0.034960405
S100A7A -0.008190105
$TRG45
24 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -4.441322326
ALB 0.010867331
AQP9 0.051908908
CALML5 -0.126359364
CCL4 0.059869284
CDK6 -0.283781437
CHGA 0.088283426
CREB3L3 -0.109988758
CSF3 0.012311176
CXCL5 0.030532485
CXCL6 0.063610112
CXCL8 0.084385732
ERBB2 0.102303095
FGFRL1 -0.148759702
IL1B 0.072225606
IL6 0.122671026
ISG15 0.297386587
KLK5 -0.039891341
KRT14 0.004536929
KRT17 -0.041693665
MAGEA1 0.154075458
OSM -0.028987648
PTGS2 0.034960405
S100A7A 0.008190105
>
Likely none of genes is significant so I just tried 5 genes from first glm
> finalLasso <- glm(df$TRG ~ CDK6 + IL6 + ISG15 + CXCL8 + CALML5 , data=df, family=binomial(link="logit"))
> summary(finalLasso)
Call:
glm(formula = df$TRG ~ CDK6 + IL6 + ISG15 + CXCL8 + CALML5, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6780 -0.6733 0.1643 0.6925 1.9182
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4178 14.7456 -0.028 0.9774
CDK6 -2.2672 1.1537 -1.965 0.0494 *
IL6 1.2524 0.9197 1.362 0.1733
ISG15 1.1763 0.5479 2.147 0.0318 *
CXCL8 0.5413 0.6443 0.840 0.4008
CALML5 -0.7890 0.4219 -1.870 0.0615 .
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 77.347 on 55 degrees of freedom
Residual deviance: 49.839 on 50 degrees of freedom
AIC: 61.839
Number of Fisher Scoring iterations: 6
>
That 2 genes are significant; really confusing :(
Please somebody save me from this horrible confusion, finally which genes are significant remaining in model and which not
First confusion: Why when I am putting each of 23 genes individually in glm model, 5 of them returns p-value < 0.05 but when summary of final model containing these 5 genes says only p-value of one of them is < 0.05
Second confusion: When I am using stepwise regression, summary of model says none of my genes is significant while some of them had been remained as predictors by
coef(step.model)Third confusion: CDK6 as my significant genes in two of model shows negative coefficient while I know that show be correlated with dependent variable positively