1

I have a question regarding defining frequently mutated genes in breast cancer cell lines and primary samples.

Basically I'm having difficulty deciding on a threshold for what to call a 'frequently mutated gene.' For example I have a list of 40 cell lines with 500 genes, and for each gene a cell line is labelled as either a 1 (mutated) or 0 (wild-type). I'd like to set a threshold like if a gene is mutated in 4 or more out of 40 cell lines (>10%) then it is frequently mutated. However I'm struggling to find resources which outline how this threshold can be defined, rather than just setting an arbitrary cut off. My end goal is to generate a list of frequently mutated genes in cell lines and compare this list to frequently mutated genes in the corresponding primary tissue sample.

Does anyone know of any resources/papers where something similar has been performed or have any suggestions on how to set a threshold for defining a frequently mutated gene?

Thanks!

2 Answers2

2

One now-standard approach to the problem of finding genes that are frequently mutated among multiple cases of a particular type of tumor is the MutSig system developed at the Broad Institute. The general idea is to determine whether a gene is found to be mutated more frequently in tumor samples than might be expected by chance, rather than setting an arbitrary threshold.

The problem is that mutation rates "expected by chance" can differ substantially among genes depending on their size, replication time during the cell cycle, and so on. The MutSig page has a useful introduction to how thinking about this issue has evolved over the past several years.

If you actually have (as your question might imply) 40 matched cell-line/primary tumor samples you might be able to answer your question by doing direct paired comparisons. If you have those paired data, you would lose a lot by grouping the results first into cell-line and primary, as your question suggests.

Whichever way you proceed, I'm a bit worried about how you determine that a gene is "mutated" in either the cell lines or the primary samples. Primary tumors typically contain normal cells with normal DNA, diluting the mutant alleles and thus making it harder to detect true mutations. Furthermore, tumors can contain subclones harboring different mutations. Extended passage in culture would be expected to select for the subclone that propagates the fastest under those culture conditions.

So if your primary tumor sample had a mutation detected at, say, only a 10% mutant-allele fraction, would you have counted that as "mutated" or not? Would you have been able to distinguish a 10% mutation fraction generated in error from a true heterozygous mutation in 20% of the cells? Those issues need to be addressed before you go much further down this path; detection of mutations within individual tumors (in comparison to normal-tissue controls) is handled for example by MuTect if you have the raw sequence data.

EDIT in response to further information in comment:

Without having paired comparisons of cell lines and primary tumors, one way to proceed would be to determine whether there are differences in the prevalence of specific mutations between the panel of cell lines and an appropriate panel of primary tumors. This could be done with standard contingency-table approaches (e.g., Fisher exact test), with correction for multiple hypothesis testing. This would remove the need to define "frequently mutated gene" first.

In the case of breast cancer, however, previous work on the subject matter indicates that prevalence of specific mutations can depend on subtypes of breast cancer defined by patterns of gene expression. So in this case a proper comparison would first involve separating the cell lines into these subtypes, and comparing each cell-line subtype against its corresponding primary-tumor subtype. With only 40 cell lines, there might not be much power for detecting any differences.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Hi EdM, thanks for your input. I should clarify that with the breast cancer cell lines I have, I'd like to be able to compare the list of frequently mutated genes I obtain with that of primary breast tumour samples (not matched) that can be found in public databases like cBioPortal. I did find a paper where MutSig was used to find significantly mutated genes, but this this list also included genes with low mutation frequency. Can something similar be used to set a frequency threshold for the cell line data? – tolo9397 Aug 17 '15 at 00:06
  • You may have a problem here. Even within individual expression subtypes of breast cancer, only 4 genes were mutated in more than 10% of samples in the TCGA DATA. In your set of 40 cell lines that 10% value would mean only 4 lines with a mutation. So you might be able to find genes with much greater mutation frequencies than in published data, but not the other way around. Also the TCGA data reminded me that you need to take breast cancer expression subtypes into account to do this properly. – EdM Aug 17 '15 at 01:28
  • You mentioned to take into account expression subtypes in order to define a frequency threshold properly. What do you mean by this? For example, should I first group the cell lines into their subtype (i.e. triple negative, HER2+ etc), then assess the frequency of a particular mutation (i.e. mutation status of P53) and compare that to the percentages in the paper you provided? – tolo9397 Aug 17 '15 at 21:02
  • Exactly, but don't bother with a threshold. Just examine whether, for example, TP53 is more frequently mutated in HER2-enriched cell lines than it is in HER2-enriched primary tumors via Fisher's exact test. That test just needs the numbers of TP53 mutant versus wild-type in your HER2-enriched cell lines and the corresponding numbers from the paper for the tumors. Note that I've edited my answer with some of this information. – EdM Aug 17 '15 at 21:19
0

You can try changepoint or threshold analysis on the frequency distribution of mutations per gene. See approaches such as piecewise regression (and this more general review Schwarz 2015), or this changepoint vignette (Killick and Eckley 2013). Most have been used in genetics but are not specific to it (eg this more complex multiple change point example). Simply plotting the number of mutations/gene and looking for obvious breakpoints has also been used in genetics, but would not qualify as a strong approach here.

However, it is not clear how it would be beneficial for your final end goal, as formulated above, to reduce this to a classification problem (labeling genes as frequently mutated vs. not) as opposed to comparing numbers of mutations/gene for cell lines vs. primary tissue.

katya
  • 2,142