I am working with genes and I am designing a model that its dependent variable must be diagnosis, whereas the rest of potential variables to include are sex, ethnicity(2 levels), rin(RNA integrity number), age and region(2 different body areas where sampling occured). Rin and age are continuous and I have transformed them to categorical (3 levels each):
| # | diag | sex | ethn | rin | age | region |
|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 6 | 1 | 1 | 1 | 2 | 3 | 1 |
| 9 | 0 | 0 | 1 | 2 | 1 | 0 |
| 10 | 0 | 1 | 1 | 2 | 3 | 0 |
| 11 | 1 | 0 | 0 | 3 | 2 | 0 |
.......etc
I have around 400 samples, 150 of them as controls, and traits region and sex are highly skewed.
I performed ANOVA testing for nested glm models to define which of the variables are the most important to include and these are ethnicity, sex and age:
**Analysis of Deviance Table**
Model 1: as.factor(diag) ~ as.factor(ethn) + as.factor(sex)
Model 2: as.factor(diag) ~ as.factor(ethn) + as.factor(sex) + as.factor(age)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 381 500.60
2 379 473.39 2 27.212 1.233e-06 ***
Analysis of Deviance Table
Model 1: as.factor(diag) ~ as.factor(ethn) + as.factor(sex) + as.factor(age)
Model 2: as.factor(diag) ~ as.factor(ethn) + as.factor(sex) + as.factor(age) + as.factor(rin)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 379 473.39
2 377 470.76 2 2.6258 0.269
Because after PCA I suspected batch effects, I performed surrogate value analysis(using https://bioconductor.org/packages/sva/) by trying different combinations of variables. But I notice that the most significant surrogate values arise when I include unimportant variables such as region/rin(sv3, p<0.000001), but not when I tried my original statistically significant variables(sv1, p<0.03):
**mod0** = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age), data=datTraits)
mod = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age), data=datTraits)
Number of significant surrogate variables is: 1
**mod0** = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(rin), data=datTraits)
mod = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(rin), data=datTraits)
Number of significant surrogate variables is: 5
**mod0** = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region)+as.factor(rin), data=datTraits)
mod = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region)+as.factor(rin), data=datTraits)
Number of significant surrogate variables is: 1
**mod0** = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region), data=datTraits)
mod = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region), data=datTraits)
Number of significant surrogate variables is: 5
My goal is to remove from my model the surrogate value that captures as much variance as possible and is not correlated to dependent variable "diag".
Could anyone please tell me what I am doing wrong???
Prida
PS: Below I have added the SV results when I have included the region variable and when not. Since in the first case the surrogate value is positively correlated to the dependent biological variable, shall I keep it?
mod0 = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region), data=datTraits)
mod = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region), data=datTraits)
Response Y3 :
Call:
lm(formula = Y3 ~ datTraits$diag)
Residuals:
Min 1Q Median 3Q Max
-0.11038 -0.03425 -0.00597 0.03375 0.15983
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.009628 0.003157 -3.050 0.00245 **
datTraits$diag1 0.026792 0.005266 5.088 5.7e-07 ***
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04951 on 382 degrees of freedom
Multiple R-squared: 0.06346, Adjusted R-squared: 0.06101
F-statistic: 25.88 on 1 and 382 DF, p-value: 5.695e-07
mod0 = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age), data=datTraits)
mod = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age), data=datTraits)
Call:
lm(formula = svobj$sv ~ datTraits$diag)
Residuals:
Min 1Q Median 3Q Max
-0.09461 -0.06206 0.02571 0.04066 0.06141
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.004097 0.003243 1.263 0.2073
datTraits$diag1 -0.011401 0.005410 -2.107 0.0357 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05087 on 382 degrees of freedom
Multiple R-squared: 0.01149, Adjusted R-squared: 0.008904
F-statistic: 4.441 on 1 and 382 DF, p-value: 0.03574
```
rinthe "RNA integrity number" characterizing the quality of an RNA prep, or something else, and what isregion? Third, how do you intend to combine your diagnosis-as-outcome model with SVA and gene-expression results? Finally, how many patients, how many genes, and how many diagnoses are involved, and how many patients have which diagnosis? – EdM May 01 '22 at 15:23