4

I have Illumina MiSeq paired-end reads from 150bp amplicons mapped to my reference genome (> 1000X coverage). These reads have indels that may or may not induce frameshifts. If the indel induces a frameshift (i.e. indel is not a multiple of 3bp), I want to know if that leads to a stop codon later in the read. Is there any package/method that would allow me to do this?

Few extra details

  • I might be wrong but I think this is different than gene prediction as I would want the tool to scan each read in the correct frame; not try different frames to detect ORFs etc.
  • I expect mosaicism, so I do not want to compute a single assembly/consensus and predict stop codons in it.

Here is how I imagine such a script/tool would work:

  1. Take the position of the first translated codon of the exon I am interested in (from Ensembl for instance?). Eg. gene abc/exon1 starts at chr1:1000.

Then for each read:

  1. Look at which position the read is mapped to my reference. Eg. read 1 is mapped to chr1:1050.
  2. Compute the position of the first full codon in the correct frame in my read. Eg. start codon at chr1:1000 but my read starts at chr1:1050, the first full translated codon in my read should start at chr1:1052 (I think?).
  3. Split all my read into 3-mers starting at that position and build something like a 3-mer/position dictionnary.
  4. Look for stop codons and report the positions.

Does it exist?

francoiskroll
  • 221
  • 1
  • 3
  • 1
    Could you give us a bit more context? Is this targeted sequencing of a eukaryote? Or whole genome bacterial sequencing? So as input you have a reference genome (FASTA), annotated/predicted genes (GTF/GFF3), and your NGS reads (FASTQ/BAM)? And you'd like to predict stop codons based on the read mappings? Do you have reason to believe the annotated stop codons are incorrect? Are you looking for nonsense mutations? – Daniel Standage Mar 27 '19 at 12:57
  • @DanielStandage It is targeted sequencing of small (150bp) amplicons from zebrafish genomic DNA. I am sequencing after inducing mutations, i.e. my reads contain various indels. This is why I need to do this for each read in my alignment. I am looking for nonsense mutations, but not only. Eg. there is a 2bp deletion early in the read, I want to know if there will be a stop codon downstream as a result when the RNA gets translated. – francoiskroll Mar 27 '19 at 14:08
  • 1

2 Answers2

1

With 150bp reads, you might be able to blastx all the unique sequences you've got against the target proteins.

swbarnes2
  • 1,925
  • 7
  • 6
0

I just recently discovered the ampliCan R package from the Valen lab after asking this question. I highly recommend it. It does all of what I was expecting & more.

To install: https://bioconductor.org/packages/release/bioc/html/amplican.html

To run (& other explanations): https://rdrr.io/bioc/amplican/f/vignettes/amplicanOverview.Rmd

This is the original paper & Do not forget to credit the people! https://genome.cshlp.org/content/early/2019/03/08/gr.244293.118.abstract

francoiskroll
  • 221
  • 1
  • 3