3

I want to find some good predictors (genes). I have checked the Spearman correlation of the expression of each of 23 genes with dependent variables (responders Vs non-responders and I saw only the 5 genes identified initially showed significant correlation with the dependent variable.

This is my data, log transformed RNA-seq:

              TRG    CDK6 EGFR  KIF2C CDC20
    Sample 1  TRG12  11.39 10.62  9.75 10.34
    Sample 2  TRG12  10.16  8.63  8.68  9.08
    Sample 3  TRG12   9.29 10.24  9.89 10.11
    Sample 4  TRG45  11.53  9.22  9.35  9.13
    Sample 5  TRG45   8.35 10.62 10.25 10.01
    Sample 6  TRG45  11.71 10.43  8.87  9.44
    ...

23 differentially expressed genes between responder patients (TRG12) and non-responder patients (TRG45).

1- I tested each of 23 genes individually in this code and each of them gives p-value < 0.05 remained as a good predictor; For example for CDK6 I have done

> glm=glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))
> 
> summary(glm)

Call: glm(formula = TRG ~ CDK6, family = binomial(link = "logit"), data = df)

Deviance Residuals: Min 1Q Median 3Q Max
-1.8286 -0.9317 0.6382 0.9500 1.5966

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) 26.8950 9.6627 2.783 0.00538 ** CDK6 -2.6932 0.9765 -2.758 0.00582 **


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 77.347  on 55  degrees of freedom

Residual deviance: 64.062 on 54 degrees of freedom AIC: 68.062

Number of Fisher Scoring iterations: 5

Finally I obtained five genes and I put them in this model:

final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))

but

> summary(final)

Call: glm(formula = TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2, family = binomial(link = "logit"), data = df)

Deviance Residuals: Min 1Q Median 3Q Max
-1.5696 -0.9090 0.2233 0.7125 2.0131

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.2258 13.3877 0.241 0.8096
CDK6 -2.6101 1.0860 -2.403 0.0162 * CXCL8 0.8702 0.9783 0.889 0.3737
IL6 0.9591 0.8883 1.080 0.2803
ISG15 0.9574 0.5101 1.877 0.0605 . PTGS2 -0.4623 1.1005 -0.420 0.6744


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 77.347  on 55  degrees of freedom

Residual deviance: 53.917 on 50 degrees of freedom AIC: 65.917

Number of Fisher Scoring iterations: 6

I am just seeing CDK6 as significant. I done cross validation

> cv.glm(df, final, K=nrow(df))$delta
[1] 0.2130050 0.2125748

Likely the delta values are not greatly differ (Model is right ???)

Then I tried Sensitivity and Specificity

> threshold=0.5
> predicted_values<-ifelse(predict(final,type="response")>threshold,1,0)
> actual_values<-final$y
> conf_matrix<-table(predicted_values,actual_values)
> conf_matrix
                actual_values
predicted_values  0  1
               0 16  7
               1 10 23
> sensitivity(conf_matrix)
[1] 0.6153846
> specificity(conf_matrix)
[1] 0.7666667

2- I went this way for step-wise method

full.model <- glm(TRG ~., data = df, family = binomial)

library(MASS) step.model <- full.model %>% stepAIC(trace = FALSE) coef(step.model)

> coef(step.model) (Intercept) CDK6 CHGA CSF3 CXCL8 ERBB2 IL6 ISG15 KRT14 KRT17 -23770.6261 -469.3660 376.3528 -238.5697 1404.5905 590.2586 789.5295 842.2058 423.0402 -651.3403 MAGEA1 OSM PTGS2 496.6933 -637.0583 -500.1040

In addition to 5 genes another genes remains as predictors but likely none of them are significant by Pr(>|z|) > 0.05

> summary(step.model)

Call: glm(formula = TRG ~ CDK6 + CHGA + CSF3 + CXCL8 + ERBB2 + IL6 + ISG15 + KRT14 + KRT17 + MAGEA1 + OSM + PTGS2, family = binomial, data = df)

Deviance Residuals: Min 1Q Median 3Q Max
-1.317e-04 -2.100e-08 2.100e-08 2.100e-08 1.396e-04

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -23770.6 2922188.9 -0.008 0.994 CDK6 -469.4 64920.6 -0.007 0.994 CHGA 376.4 46191.2 0.008 0.993 CSF3 -238.6 30857.1 -0.008 0.994 CXCL8 1404.6 172558.7 0.008 0.994 ERBB2 590.3 71989.5 0.008 0.993 IL6 789.5 100130.7 0.008 0.994 ISG15 842.2 102934.0 0.008 0.993 KRT14 423.0 52565.5 0.008 0.994 KRT17 -651.3 79489.6 -0.008 0.993 MAGEA1 496.7 60319.6 0.008 0.993 OSM -637.1 82121.8 -0.008 0.994 PTGS2 -500.1 73541.7 -0.007 0.995

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 7.7347e+01  on 55  degrees of freedom

Residual deviance: 1.4741e-07 on 43 degrees of freedom AIC: 26

Number of Fisher Scoring iterations: 25

3- I went to lasso regression

> library(glmnet)
> lassoModel <- glmnet(
+   x=data.matrix(df[,-1]),
+   y=df$TRG,
+   standardize=TRUE,
+   alpha=1.0,
+   family="multinomial")
> #Derive coefficients for each gene to each subtype
> co <- coef(lassoModel, s=idealLambda, exact=TRUE)
> co
$`TRG12`
24 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)  4.441322326
ALB         -0.010867331
AQP9        -0.051908908
CALML5       0.126359364
CCL4        -0.059869284
CDK6         0.283781437
CHGA        -0.088283426
CREB3L3      0.109988758
CSF3        -0.012311176
CXCL5       -0.030532485
CXCL6       -0.063610112
CXCL8       -0.084385732
ERBB2       -0.102303095
FGFRL1       0.148759702
IL1B        -0.072225606
IL6         -0.122671026
ISG15       -0.297386587
KLK5         0.039891341
KRT14       -0.004536929
KRT17        0.041693665
MAGEA1      -0.154075458
OSM          0.028987648
PTGS2       -0.034960405
S100A7A     -0.008190105

$TRG45 24 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) -4.441322326 ALB 0.010867331 AQP9 0.051908908 CALML5 -0.126359364 CCL4 0.059869284 CDK6 -0.283781437 CHGA 0.088283426 CREB3L3 -0.109988758 CSF3 0.012311176 CXCL5 0.030532485 CXCL6 0.063610112 CXCL8 0.084385732 ERBB2 0.102303095 FGFRL1 -0.148759702 IL1B 0.072225606 IL6 0.122671026 ISG15 0.297386587 KLK5 -0.039891341 KRT14 0.004536929 KRT17 -0.041693665 MAGEA1 0.154075458 OSM -0.028987648 PTGS2 0.034960405 S100A7A 0.008190105

>

Likely none of genes is significant so I just tried 5 genes from first glm

> finalLasso <- glm(df$TRG ~ CDK6 + IL6 + ISG15 + CXCL8 + CALML5 , data=df, family=binomial(link="logit"))
> summary(finalLasso)

Call: glm(formula = df$TRG ~ CDK6 + IL6 + ISG15 + CXCL8 + CALML5, family = binomial(link = "logit"), data = df)

Deviance Residuals: Min 1Q Median 3Q Max
-1.6780 -0.6733 0.1643 0.6925 1.9182

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4178 14.7456 -0.028 0.9774
CDK6 -2.2672 1.1537 -1.965 0.0494 * IL6 1.2524 0.9197 1.362 0.1733
ISG15 1.1763 0.5479 2.147 0.0318 * CXCL8 0.5413 0.6443 0.840 0.4008
CALML5 -0.7890 0.4219 -1.870 0.0615 .


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 77.347  on 55  degrees of freedom

Residual deviance: 49.839 on 50 degrees of freedom AIC: 61.839

Number of Fisher Scoring iterations: 6

>

That 2 genes are significant; really confusing :(

Please somebody save me from this horrible confusion, finally which genes are significant remaining in model and which not

First confusion: Why when I am putting each of 23 genes individually in glm model, 5 of them returns p-value < 0.05 but when summary of final model containing these 5 genes says only p-value of one of them is < 0.05

Second confusion: When I am using stepwise regression, summary of model says none of my genes is significant while some of them had been remained as predictors by coef(step.model)

Third confusion: CDK6 as my significant genes in two of model shows negative coefficient while I know that show be correlated with dependent variable positively

gringer
  • 14,012
  • 5
  • 23
  • 79
Zizogolu
  • 2,148
  • 11
  • 44

1 Answers1

0

confusion: Why when I am putting each of 23 genes individually in glm model, 5 of them returns p-value < 0.05 but when summary of final model containing these 5 genes says only p-value of one of them is < 0.05.

Ans. Pr(>|z|) simply suggests that whether z-value deviates beyond average z-value and if so,what is the extent of its probability. less than .05 indicates a 95% confidence in computed z value i.e. an acceptable significance level. The summary of 5 genes simply tells significance of over-all model.

Subhash C. Davar
  • 202
  • 1
  • 10