3

I would be grateful if someone could take a quick look at these FASTQC results. This is rna-seq paired-end data. From the FASTQC manual, an unusual distribution seems to be suggestive of contamination and a shift in the curve is suggestive of a systematic bias. GC content distribution both pre-alignment and post-alignment are strange.

Samples are paired end, strand specific and % of mapped reads is above 95% for all the samples. There is no adapter content also.

What could be the problem?

enter image description here enter image description here

beginner
  • 631
  • 7
  • 15
  • 1
    What are you trying to sequence? Something more specific than "poly-A amplified RNA" would be helpful to know if this is unexpected. – gringer Feb 18 '18 at 23:38
  • In the line of gringer. What specie are you sequencing ? Is there some background about GC distribution for this type of sequencing? Did you check with other previous sequences to have such a problem? (It could be a termal microorganism and be totally normal or not...) – llrs Feb 19 '18 at 08:38
  • Its human species. except per sequence GC content and duplication levels of every sample other plots are totally fine. there is no adapter content also. samples are strand specific sequenced with poly-A selection. % of mapped reads is above 95% for all the samples. Paired end reads % average GC content is <50% or equal to 50%. % of duplicate reads is 65-70% – beginner Feb 19 '18 at 09:15
  • I see sequence bias in 5' and 3' end also in per base sequence plot. Is this GC content distribution due to bias selection? – beginner Feb 19 '18 at 10:11
  • What kit have you used to generate your libraries ? – Valentina Discepolo Apr 18 '18 at 17:20
  • Ribosomal rna depletion kit – beginner Apr 19 '18 at 08:57
  • Okay, so ribosomal depletion kits normally produce total RNA rather than polyA selected RNA. Normally you would do both ribo depletion and polyA selection. If you did ribo depletion rather than polyA selection, I would suggest you might be looking a contamination with ribosomal sequences. That would explain why you still see high mapping rates. – Ian Sudbery Apr 19 '18 at 09:22
  • I'm not sure how normally people are combining ribo depletion and polyA, doing one obviates doing the other. However it's worth noting that ribo depletion is hit or miss regarding how well it actually works. I've seen cases with 1-2% rRNA afterward and cases with 20-30% rRNA, it depends a lot on how fresh the samples are. – Devon Ryan Apr 19 '18 at 09:25
  • @IanSudbery so is this common with ribo depletion kit? There is no adapter content also. Can I go further for with these samples? – beginner Apr 19 '18 at 13:43
  • Any kit can work inefficiently. You could check this by looking at how many reads map to ribosomal genes or ribosomal RNAs. Its not neccesasrily a problem, other than that you might end up with low counts in none ribosomal genes. I also be very careful when it came to normalisation in a DE experiment: large amounts of sequencing estate taken by a small number of genes can throw off normalisation for other genes. – Ian Sudbery Apr 20 '18 at 07:41
  • Yes when I got the counts from the bam files after aligning to genome, I see most of the samples having approx. 25k-40k genes with 0 counts. Is this due to the reason what you said? How can I check reads mapping ribosomal genes or rRNAs? – beginner Apr 20 '18 at 07:55
  • @IanSudbery Hi, Could you please check my above comment. And could you please tell how to check reads mapping to ribosomal genes or rRNAs? – beginner Apr 23 '18 at 08:39
  • 1
    Sorry, our mapping pipeline does all this automatically and I had to dig out the code to remind myself how its done.

    If you think you've got a very skewed counts distribution, i'd start by looking at the identities of the genes with the highest counts.

    Alternatively to look directly at rRNA genes, you can download their annotations from UCSC or biomart with the correct biotypes. For the ribosomal proteins use go term GO:0003735. Then quantify with whatever you quantify with.

    % reads mapping to protein coding genes is generally a pretty good QC metric.

    – Ian Sudbery Apr 23 '18 at 09:43

1 Answers1

3

Although generally I recommend people to be wary about the warnings that FastQC gives (it tends to be overly paranoid), the GC content graphs here do look odd. It's good that you've had a look at it with another program to confirm the observation. They should have a distribution that is fairly close to normal, a bit like this:

GC content from mouse cDNA

Without more information about your sample, I can't tell whether or not this is expected from your input. This looks like fairly substantial contamination, but could also be caused by biased selection during sample preparation (e.g. low quality sample, low yield, PCR amplification bias), or just looking at a multi-organism sample rather than a single-organism sample.

As a contamination check, I'd recommend running your RNASeq reads through a kmer-based classification tool (e.g. OneCodex, Centrifuge, BBSketch, or Sourmash). This should tell you the most likely place that contaminant sequences came from.

gringer
  • 14,012
  • 5
  • 23
  • 79