0

Fundamental question:

I have one dataset which I have used to build and generate a model for survival prediction.

Here I get like 40 genes as my predictor which I have tested both in my test and train dataset so for this I have used the TCGA cohort. Now to test in an independent data I have chosen another patient cohort bigger one beatAML dataset which i want to make it as my validation dataset. Now the issue I'm facing is for some of the gene which i have got coefficient are missing in my validation data set.

So simple question is it conceptually correct to eliminate those genes which are not present in my validation cohort and go ahead with the ones which are present.

PesKchan
  • 155
  • 5

1 Answers1

1

So simple question is it conceptually correct to eliminate those genes which are not present in my validation cohort and go ahead with the ones which are present.

You might start by seeing how well your original model without those genes works on The Cancer Genome Atlas (TCGA) data. I suspect that it won't be as good, but that would be one thing to try. If you think that the model without those genes is good enough, then proceed as you suggest.

I don't think, however, that removing genes this way is a good strategy in general.

The lasso tag on this question suggests that you used LASSO to define your set of 40 genes out of the approximately 20,000 included in data from TCGA. With so many genes and only a few hundred cases (if you restricted analysis to acute myeloid leukemia, AML), many gene-expression values will be highly inter-correlated. LASSO will choose 1 or a few, at most, from each set of correlated genes. The particular choice will depend on vagaries of the data set rather than on overall population characteristics.

In that situation, each of your 40 genes is likely to represent a large number of other genes with similar expression patterns. If you simply omit some of those 40 genes in future work, you are losing information not just about that gene but also about all of the other genes, not included in your set of 40, that it helps to represent.

Although you might think of your modeling so far as having identified a single model, what you have done might better be considered having evaluated a modeling process: a particular way to use LASSO that provided a useful model in AML. You might consider repeating the same modeling process again to the TCGA data, but omitting genes whose expression values aren't available in the newer "beatAML" data set. With the large number of genes whose expression is likely to be correlated with those you removed from the TCGA data, I suspect that you will find similar performance with a new set of around 40 genes in TCGA. Then you can use that model in the test cohort provided by "beatAML."

A final comment: sometimes an apparently "missing" gene in a data set is due to different labeling of the same gene in different data sets. Check that simpler explanation first.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • "A final comment: sometimes an apparently "missing" gene in a data set is due to different labeling of the same gene in different data sets" yes i did check that to resolve that I changed my genes to enseml ID now its resolved 2-3 genes which I missed as i filtered initially for very low expressed counts – PesKchan Apr 06 '23 at 05:55
  • another request https://stats.stackexchange.com/questions/612178/interpreting-survival-analysis-plot-using-genes-as-predictor – PesKchan Apr 06 '23 at 19:11