5

I have some RNA-seq data where there are reads from the "host" as well as from several bacteria species. In this experimental context, I am interested in the host associated reads and the latter is regarded as "contaminants".

I am using BBSplit to split host- and bacteria-associated reads using my genome of interest as well as a "repo" containing thousands of bacterial genomes that are likely to be present in the host.

This repo has multiple genome assemblies for the same species making the BBSplit step computationally expensive. The total size of the repo is approximatly 25GB, more than 25X the human genome! I am wondering what would be the best criteria to pick one assembly of a particular genome over another assembly.

All in all, different genome assemblies of a given species have more or less similar total lengths, %GC, number of contigs meaning that picking any would probably be fine but not being a microbiologist, I would like to hear expert's opinions on this. My current take is to take the assembly with the largest size for each species.

haci
  • 4,092
  • 1
  • 6
  • 28
  • Have you tried testing a couple of randomly chosen ones? Do your results vary significantly? – terdon Feb 26 '24 at 13:20
  • This is rather difficult to test as this would require comparing count matrices generated after removing reads associated with one bacterial genome assembly for another. There are hundreds of species and each has multiple assemblies. The biggest problem is the computational overhead with the mapping with BBSplit. I believe the best way to proceed would be generating a pan-genome for each species and then map to these but I don't have much experince with working with pan-genomes. – haci Feb 26 '24 at 13:28
  • do you have the genome of your organism of interest/host? Or even a transcriptome of a related one? You may also cluster reads in your FASTQ files using clumpify from BBMap prior to any mapping/species assignment to speed upthe subsequent steps. – darked89 Feb 26 '24 at 13:30
  • Fair enough. I was just thinking that if you tried the analysis with one genome and then with another and compared the results, it might give you a vague indication of whether it's worth playing with the many genomes or if, for your purposes, any of them will be good enough. – terdon Feb 26 '24 at 13:30
  • "My current take is to take the assembly with the largest size for each species." .... Well yes possibly for bacteria - its complicated ..... We'd need more biological context and what the central biological problem is. Basically whats the host? Is this an infection? – M__ Feb 26 '24 at 21:47
  • Thanks @darked89, I was not aware of clumpify, I will take a look. And my host organism is human. – haci Mar 01 '24 at 14:33
  • 1
    @haci imho you can just map all the reads to the human genome using splice aware mapper (i.e. STAR). The overwhelming majority of reads from non-eukariots will simply not map to a human genome. If you are not interested in them, there is no need to perform a costly filtering. – darked89 Mar 01 '24 at 15:20
  • Hi @darked89, while your suggestion would be OK for most of the genes, there might and will be genes that will be affected. At least I cannot claim otherwise. As a matter of fact, an initial quick and dirty analysis shows that my count matrices generated with our w/o filtering has a correlation coef of 0.93, much less than I anticipated. I will revive this thread once my colleagues compare the DEG lists to see the "real effect". – haci Mar 01 '24 at 18:18
  • @haci what is your read length? Also: what are genes for which there is such cross talk between human and bacteria? Last but not least: unmapped reads do not in influence your human read counts => you may speed up filtering of the putative bacterial reads using just the reads which mapped to human genome. Bonus point: this will let you check if the filtered out reads really are bacterial or human. – darked89 Mar 01 '24 at 23:49
  • @darked89, the whole point of this exercise is that I don't have to "worry about" such a cross-talk, if any. I know that unmapped reads would not have an effect, my problem is the opposite, "the origin of reads that map to the human genome". I am already performing what you suggest and I can tell a simple matrix subtraction of count tables generated w or w/o filtering does not give a matrix full of zeros or sth similar as I was initially expecting, so there is some effect. And I have 50bp SE reads, that is another problem, I would have felt much safer with longer PE reads. – haci Mar 02 '24 at 07:25
  • 1
    I tend to agree with @darked89 here. Most reads that come from bacteria will simply not map when you map to the host. Moreover, for those reads that actually would map to both, host and bacteria, you have no way of knowing where the respective RNA actually originates from. Removing those would assume that they certainly not from the host. The fact that some reads that map to human also map to a huge collection of bacteria does not really prove that they are of bacterial origin. – Niklas Mar 06 '24 at 18:43

2 Answers2

0

A couple of ideas:

  1. Use kraken2 to classify the reads. If you are able to convert your "repo" to kraken2 format, then the matching will be very fast and relatively accurate. Also, using bwa-mem on a joint database could work.
  2. I am not a big fan of custom and not well defined "repos". Why don't you just use the official RefSeq repository of prokaryote genomes? (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/). I don't remember its size, but I remember using it without any problem.
Fabio Marroni
  • 196
  • 1
  • 7
  • Thank you for the suggestions. 1) I have used Kraken2 as well as Centrifuge for read classification. Apparently the latter is better for shorter reads. Anyway, both failed to call more or less half the reads, probably because my read size of 50bp. 2) What I am using is the human oral microbiome database, they fetch genomes from RefSeq. And then again, I don't want to map to everything but to a collection of "good representatives". – haci Mar 02 '24 at 19:06
  • Some more "random" ideas, in addition to the idea of choosing the largest assembly for each species:
    1. you can select for each species the Reference assembly, which is indicated in RefSeq.
    2. I wouldn't be against darked89 suggestion (map to humans only) but I also understand that this should be done only in case of emergency.
    3. Have you considered mapping with STAR on the joint human- bacterial database?
    – Fabio Marroni Mar 03 '24 at 09:23
  • Thanks! Taking the reference for each species is definetly a good idea. And reagerding your third point, can you please elaborate, where can I find this human-bacterial database? Good old Google failed me this time. For STAR, I would need a genome FASTA and associated annotation GTF file and I belive such a combined genome would be GBs big. And regarding your 2nd point, how do you explain the fact that I end up with different count matrices with and w/o filtering for bacterial reads? – haci Mar 03 '24 at 14:31
  • 1
    Regarding 3rd point: you merge human and all the bacterial fasta in a multifasta and the build the STAR index with that. You are right it will be several gigabytes, but it is possible that once you create the index (IF creating the index gives no problems) STAR will be fast enough in mapping. Regarding 2nd point: I think you are right in supposing that some gene portions may be not too dissimilar from bacterial transcripts, and thus it would be better to map on all the genomes at the same time. – Fabio Marroni Mar 03 '24 at 18:00
  • My mapping attempts with BBMap took days with a the full set of genomes (hence the question in the first place) not to mention many problems I had encountered while creating the BBMap index. I don't think doing it with STAR will be much much faster. Thanks anyway. – haci Mar 03 '24 at 18:07
-1
  1. What I would suggest, without knowing much biology about the system is it's a tree problem. The difficulty is knowing whether the repo is prokaryotic. To solve it I'd take the 16S gene and make a tree - each 16S sequence contains its genome ID. I'd then draw a tree, e.g. IQ-TREE is good and fast. This will describe the overall diversity and provides and informed basis of how to represent each species.

  2. Your approach certainly has merit because again if these are bacteria they go through gene loss - which can cause problems. Selecting the largest genome ensures you minimise this loss.

What I would do is combine 1 and 2. This is to label the taxa that has the largest genome and then produce the 16S tree. It will then be obvious by manual inspection if your approach is representative of the genetic diversity. If it isn't .... you can do the whole thing manually (could be painful), the other approach is to use a tree reading package/module/library. I use ETE3, you'd use ape (@haci's from memory is a big R codes). The issue is traversing the tree from the 'biggest genome' to the most distant genetically related member of the species. What you do is identify the MRCA (most recent common ancestor) for that species and pick a taxa that is in the opposite clade to the biggest genome and any member will do. Thus this is automating the process via Python or R code so thousands of genomes/species can be screened.

To try and be clearer, what you are doing in the code is moving the tree one node at a time and checking if all the members below it are the same species. As soon as the answer is 'no', it's the previous node thats the MRCA for the species. In a bifurcating tree that divides the data into its two biggest groups of genetic diversity.

This approach ensures you get a reasonable representation of the species, by selecting two members of it. One is the largest genome.

It's a while since I've used ETE3, so I can't remember the methods, there's probably a MRCA method for a given species. You can do this empirically, doing a recursive climb up the tree from the biggest genome until your monophyly contains a difference species. The previous traversal is the MRCA, then you ask it to divide the species into two clades and any member of the clade without the biggest clade is what you want.

I agree that it needs a bit of "tree-thinking" to do this ... but it's doable. You then use the genome IDs to filter your repo and produce a greatly slimmed down repo.

You could alternative switch to an HPC method (e.g. AWS) using the entire repo

The alternative approach is know a lot about the biology/ pathogenesis of the problem.

M__
  • 12,263
  • 5
  • 28
  • 47
  • Well the answer is correct, regardless of the number of downvotes. Explaining it clearly - particularly in code - is a lot of text. Basically to automate a representative genetic diversity data set is not trivial, unless there's an external biological rationale, but here nothing specified. – M__ Feb 28 '24 at 22:09
  • 1
    Thanks @M_ for the detailed answer. The contaminants are bacteria and the host is human. I think I will first try your suggestion of using 16 trees to see if taking the largest genome was a good call. The fact that I have over 500 species, I somehow need to automate the representative/not-a-good-representative call. – haci Mar 01 '24 at 14:24
  • 2
    BTW, regarding the downvote in this answer, I don't really find it useful downvoting but then not explaining why. In this case, here is a complex answer in a domain that I am not experienced (hence the question) and now I don't not know which part might be "off". Giving a reason for the downvote would spark a discussion that the community would benefit. – haci Mar 01 '24 at 14:30