1

I have been running nf-core in Python and it works great! But I have a seemingly simple question that I'm struggling to find an answer for online. After running the nf-core pipeline on my RNA sequencing data, Salmon is able to provide gene counts in a file named salmon.merged.gene_counts.tsv. This file has essentially all the information I am seeking with all counts associated to a unique gene_id. But, for a large number of these gene_ids where no specific gene_name is known, I would like to further find the corresponding DNA sequence (or corresponding transcript's RNA sequence) for each gene_id. Accordingly, my question is: where can I find the information to map the gene_id to the corresponding nucleotides of the gene's (or original transcript's) sequence?

(Apologies in advance if my question is missing information or overly simply and I'm overlooking something obvious — I am quite new to this all.)

user18959
  • 13
  • 3
  • Which species? Also:can you provide any examples of gene_ids you have problems with? – darked89 Feb 15 '24 at 12:18
  • 1
    Anas platyrhynchos (mallard) — I'm trying to actually do the inverse problem and check if TP53 is present as one of the gene_ids. Looking at related data from https://midb.pnb.uconn.edu/, this gene seems to be missing there as well – user18959 Feb 15 '24 at 12:33
  • re TP53: if one search for p53 gene here: https://www.ensembl.org/Anas_platyrhynchos/Info/Index I do not see the TP53 itself, just the genes binding to it, or downstream targets. If the genome is incomplete/full of gaps then tp53 may be missing from annotation. I guess not just in ENSEMBL but from other genome assembly based as well. – darked89 Feb 15 '24 at 12:59
  • If you look at the orthologues of human TP53 (ENSEMBL), you can find them in chicken, Kakapo, Zebra finch plus a bunch of reptiles. While this is not an ultimate proof that TP53 does exist in your duck, I would expect that some duck RNA-Seq may contain at least fragments of TP53. – darked89 Feb 15 '24 at 13:26
  • Thank you for sharing these thoughts, @darked89. I suppose my simple question is: what is the best way to check the duck RNAseq data for any fragments of TP53? Any specific insights on methods and files to utilize would be helpful! – user18959 Feb 15 '24 at 23:29
  • Additionally, out of curiosity, how is it that the TP53 gene is available on NCBI but unavailable in Ensembl's automated curation pipeline? – user18959 Feb 16 '24 at 04:05
  • re ENSEMBL mallard tp53: from the NCBI there is a link to ENSEMBL rapid. Could be there are two mallard assemblies or at least two versions? – darked89 Feb 16 '24 at 08:35

1 Answers1

0

fishing out tp53 RNA-Seq reads

  • go to NCBI Taxonomy Browser
  • put Anas platyrhynchos as a search term
  • click on Anas platyrhynchos
  • on the right you will have a table with SRA Experiments, click on 11,679
  • restrict to RNA and paired on the left
  • send it to Run Selector
  • order it by the Bases
  • select few top runs, like SRR16082819

Now it depends what is your exact goal. You may (separate approaches):

  • get mallard transcriptome as complete as possible, use kallisto just to check that a given pair of FASTQs contains bunch of reads derived from tp53
  • use STAR to align it to the genome with the contig containing TP53 gene
  • one can also try to fish out the reads out of FASTQ using tp53 cDNA sequence, but this will likely miss i.e. putative alternative 5- & 3-prime part of the gene
darked89
  • 308
  • 1
  • 5
  • 1
    Thank you for this insightful response! For the last part, could you add more details about this step of alignment against the genome? Namely, how exactly can I confirm the contig containing the TP53 gene is present and identified using STAR? – user18959 Feb 17 '24 at 15:06
  • It is this contig NW_024009712 you will have to get both the genome fasta ans GTF annotation from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_015476345.1/ – darked89 Feb 17 '24 at 16:55
  • Also you may try to align the above contig to probably more complete chicken genome to get an idea about how good is the annotation of the p53 gene in mallard/is there anything missing. – darked89 Feb 17 '24 at 16:59
  • Thank you! For clarity, how did you find that contig? Also, strangely the genome fasta file seems to be corrupted in some way (but other files work, e.g. GTF annotation). Is this genome fasta file hosted anywhere else by chance? – user18959 Feb 20 '24 at 23:25
  • I did find this updated version (from 2023-12-14) of the genome here, but is there any way to confirm that the GFF/GTF file includes TP53? – user18959 Feb 21 '24 at 09:40
  • download the annotation then: zcat GCF_015476345.1_ZJU1.0_genomic.gff.gz | grep "gene=TP53;" | more – darked89 Feb 21 '24 at 09:54
  • 1
    That works great — thank you! Going to test this all out now – user18959 Feb 21 '24 at 09:58
  • I tried converting the GFF to GTF using this method but keep hitting into: KeyError: 'id-LOC110351722' (perhaps due to "gene_id" key not being present for key-value pairings of the .gtf file) — trying to resolve it, but if you have any thoughts, certainly glad to receive it – user18959 Feb 21 '24 at 10:15
  • re conversion gff to gtf: this is a separate question. BTW, why you need it? – darked89 Feb 21 '24 at 10:32
  • 1
    It's part of the nf-core pipeline — it expects annotations in GTF format, but the above link only provides GFF files. I tried using gff3_ID_generator.py to generate IDs if the gff file does not have them for every feature, but the error persists – user18959 Feb 21 '24 at 10:42
  • I think I've found the solution here where the larger repo of files are available – user18959 Feb 21 '24 at 11:01
  • Ok. Still, file another question about the gff to gtf conversion of that particular gff file. And it will be helpful if you up vote and accept the answer as a correct one – darked89 Feb 21 '24 at 11:07
  • Yep, just running and testing the pipeline with the updated fasta and gtf files to determine if TP53 is ultimately found with the nf-core pipeline (which I expect it should) — but these new files keep resulting in the pipeline terminating due to an error somewhere in the files which I'm trying to trace down – user18959 Feb 22 '24 at 00:38
  • 1
    Got it working now. These gtf files have a lot information that needs to be cleaned / corrected (e.g. missing gene IDs), but it appears to now run! – user18959 Mar 04 '24 at 05:55