4

I have a matrix, sample by exon, containing a copy number value for each pair (sample, exon). I would like to generate a second matrix, sample by gene, where the copy number of the exons is consolidated in a single value per gene.

Due to the nature of copy number alteration, I am not sure if this is meaningful, or feasible (e.g., how can it be summarized, in the case a gene is interested by 2 alterations with different copy numbers?).

Is there a common approach that is used in this case?

Edit

A possible example is the following:

         gene1_exon1  gene1_exon2  gene1_exon3 ... geneN_exonM
sample1            2            1            1               0
sample2            2            2            1               0
...
sampleJ            3            2            2               2

A possible summarized matrix (with majority voting - not sure if this would be the best approach):

         gene1 ... geneN
sample1      1         0
sample2      2         0
...
sampleJ      2         2

I believe this is more an issue related to the biological interpretation than the actual implementation (i.e., I am already implementing a function in pandas). My concern is on the interpretation, since my prototype select as gene copy number the majority copy number from its exons. However, this has some caveats (e.g., what if you have a GAIN on an exon, and a LOSS on a second exon, for the same gene - will you call GAIN or LOSS?). The big picture is that I want to compare two samples CNAs at the gene level, but I could not find this task discussed online.

The main objective of this summarization is to compute gene level concordance between samples. Exon-level or base-level concordance is computed in parallel, in order to have different CNA granularities.

However, one might not use the majority voting proposed and, e.g., call a CNA on a gene if the alteration is found in at least p% of exons (e.g., 30% of exons) - this is just an example, I am not suggesting to use this approach.

gringer
  • 14,012
  • 5
  • 23
  • 79
gc5
  • 1,783
  • 18
  • 32
  • 1
    You might be interested in the approach we took when trying to find CNVs that overlap known, reported ones. have a look at our documentation here: https://varsome.com/about/resources/sv-implementation/. I'll try and get the relevant bits into an answer if I get the chance. – terdon Feb 24 '22 at 15:31
  • Thank you! I will look into it – gc5 Feb 24 '22 at 17:31
  • 1
    Take it with a pinch of salt, of course: that was not peer reviewed and was just the best system we could come up with based on our own best judgment. That said, I do think it is a reasonable approach, I am just aware that I have been wrong before! :) – terdon Feb 25 '22 at 14:54

0 Answers0