4

My aim is to parse an experimental transcript set (obtained by RNAseq) to check which splice junctions are already reported in a reference annotation and which ones are new.

I tried Regtools:

  1. feed the experimental alignment file (BAM: aligned transcripts to reference genome) to Regtools junctions extract
  2. use Regtools junctions annotate to compare with the reference annotation (GTF)

Exact commands I used for this are:

  1. [..]/regtools/build/regtools junctions extract aln_sorted.bam -o regtools.bed - where the sorted BAM file has index in the same folder
  2. [..]/regtools/build/regtools junctions annotate regtools.bed [..]/ref/genome.fa [..]/ref/annotation.gtf -o regtools_annotate_out

The alignments have been generated with Minimap2.

It did not behave as expected: this gives me as output a file with only junctions annotated as 'new' ($11==N), while other annotations should be possible according to the manual:

DA - This exon-exon junction is present in the transcriptome provided by the user(GTF) The ends of this junction are hence known donor and known acceptor sites according to the GTF file.

NDA - This exon-exon junction is not present(novel) in the transcriptome provided by the user(GTF) The ends of this junction are known donor and known acceptor sites according to the GTF file.

D - This exon-exon junction is not present(novel) in the transcriptome provided by the user(GTF) The donor of this junction is a known donor but the acceptor is novel.

A - This exon-exon junction is not present(novel) in the transcriptome provided by the user(GTF) The acceptor of this junction is a known acceptor but the donor is novel.

N - This exon-exon junction is not present(novel) in the transcriptome provided by the user(GTF) The ends of this junction are hence not known donor/acceptor sites according to the GTF file.

My questions are:

  • am I doing something wrong?
  • do you have any experience with Regtools and can help debugging this behavior?
  • do you have other suggestion to achieve the output I am seeking?
aechchiki
  • 2,676
  • 11
  • 34

2 Answers2

4

You can use the script that comes with minimap2:

# install the k8 javascript shell
curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf -
cp k8-0.2.4/k8-`uname -s` $HOME/bin/k8    # assuming $HOME/bin in your $PATH

# acquire the latest minimap2
git clone https://github.com/lh3/minimap2.git

# print novel/erroneous junctions
minimap2/misc/paftools.js junceval -e anno.gtf rna-seq.sam > output.txt

Some example output:

P       read1    4       chr2    227332588       227340301       [(227332588,227340291)]
P       read2    1       chr9    19376379        19376489        [(19376388,19376493)]
P       read2    3       chr9    19378506        19378707        [(19378514,19378707)]
N       read3    1       chr1    160647335       160647472
N       read3    2       chr1    160647581       160659672

Here column 4–6 gives junctions not in the annotation. Column 7, if not empty, gives the list of annotated junctions/introns that overlap the junction in read alignment. Note that a big fraction of outputted records are errors in read alignment.

user172818
  • 6,515
  • 2
  • 13
  • 29
  • Hi @user172818, I have two questions on this: (1) could you please explain what the value in $3 represent exactly - and (2) I had cases where the value in $7 is a list of more than one tuple: do you have an explanation ? – aechchiki Feb 27 '18 at 10:54
  • @aechchiki 1) the exon number of the read. 2) an intron on a read may overlap multiple distinct introns in annotations due to alt splicing. The list gives all. – user172818 Feb 27 '18 at 12:49
1

For future googlers, Ding et al., 2017 published a review: "Comparison of Alternative Splicing Junction Detection Tools Using RNA-Seq Data".

Table 1 has the list of software in this review.

aechchiki
  • 2,676
  • 11
  • 34