0

Recently, I have downloaded a publicly available dataset, which are 350 tumor samples. I see the following information from the published paper.

enter image description here

They used Ribo Zero Gold and rRNA was depleted. Strand specific data. After aligning the data I did some alignment quality check with Qualimap RNA-Seq QC tool. I visualised the bam files in IGV. Alignment is good. For all samples 90% alignment rate was seen. I observed that in all samples Higher percentage of mapped reads were originating in Intronic regions. Followed by Exonic and intergenic regions.

I have seen a post here Reads mapped to exonic, intronic and intergenic regions where they say high intronic reads could be because of contamination. I googled about higher reads in intronic regions and found some papers Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion and some other links RIBO-DEPLETION IN RNA-SEQ – WHICH RIBOSOMAL RNA DEPLETION METHOD WORKS BEST? in which they said Greater intronic reads were with rRNA depletion protocol.

And even in this RNA-SEQ tutorial, it is mentioned that - A higher intronic mapping rate is expected for rRNA removal compared to polyA selection.

So, my question:

I am working with lncRNAs. So, I'm using the samples prepared with rRNA depletion protocol. Is this higher intronic rate is common in rRNA depleted dataset or do I have to check anything else to proceed further with these samples?

maven
  • 1
  • 1
  • Hi @maven .. cool username. I'm not really sure this is a bioinformatics question. – M__ Oct 01 '20 at 22:48

2 Answers2

1

It's not so much that you have "intronic contamination" or "genomic contamination", rather you're not selecting explicitly for full-length mature transcripts with rRNA depletion. That is the most common cause for higher intronic read rates. There's nothing you can do about this post-hoc, just continue along.

BTW, many lncRNA's are polyadenylated, so you'll keep them with poly-A selection.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • Thanks for the answer Devon. I saw this - Intronic and intergenic mapped reads could be long non coding RNAs, most of them are unannotated [https://www.researchgate.net/post/What_are_the_causes_of_reads_mapped_to_intergenic_region_in_RNA-seq]. My approach is to look for novel lncRNAs. – maven Oct 02 '20 at 08:14
  • Yes, it's a non-trivial problem if you can't filter out immature sequences before library prep. – Devon Ryan Oct 02 '20 at 08:32
  • Just to make sure, I also checked the rRNA contamination. And I observed there is very less percentage. 12408836 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 81734 + 0 mapped (0.66% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) – maven Oct 02 '20 at 10:00
  • The read counts I extracted using featureCounts are low compared to total alignments number. Is this because the read counts are from the Exonic regions? – maven Oct 02 '20 at 13:09
  • Yes, featureCounts ignore intronic reads by default. – Devon Ryan Oct 02 '20 at 13:38
  • Are there any tools to check DNA contamination? – maven Oct 14 '20 at 15:51
  • Probably, you’d want to see how much intergenic expression you have, though you’d need some baseline for comparison. – Devon Ryan Oct 14 '20 at 15:54
  • For all the 350 samples I downloaded...I see 15-30% exonic reads, 70-80% intronic reads an 5-10% interegic reads. so, what % of intergenic tell us there is DNA contamination? – maven Oct 14 '20 at 17:22
  • You’d need to mine the literature, but that’s not unreasonable levels given that most of it is probably rRNA and tRNA. – Devon Ryan Oct 14 '20 at 19:14
  • Hi Devon....How to filter our rRNA from the data....And you said more intronic reads could be from immature transcripts....Is it possible to use such data for novel lncRNA detection? – maven Oct 19 '20 at 12:14
  • It would be very difficult to use this data for lncRNA detection if you can't be relatively certain that it's high quality. You're not going to have an easy go of it. Regarding rRNA exclusion, blacklist very high expression regions that end up having tRNA or rRNA homology. – Devon Ryan Oct 19 '20 at 14:42
  • I have small basic question...How do I get the sequencing depth of the dataset I downloaded. – maven Oct 20 '20 at 09:20
  • Count the number of reads and look at their length distribution. FastQC can aid in this. – Devon Ryan Oct 20 '20 at 15:18
0

I can't possibly see how intron contamination is linked to removal of rRNA depletion.

The only reason it would appear to have increased the number of contaminants is because post-rRNA removal the proporation of intron contaminants has increased against the total remaining RNA content. However, the actual total number of contaminants remains exactly the same pre- and post-rRNA removal. By the same token the RNA of interest lncRNAs will also have proportionally increased, so you get a better depth of its predominance and diversity.

Thats just life and perhaps just filter this bioinformatically.

M__
  • 12,263
  • 5
  • 28
  • 47
  • Hi @Michael thanks for the answer. You mean I should remove these rRNA contamination and then proceed? How do I do that? Any tips please – maven Oct 02 '20 at 06:25