4

I'm beginning with the reference genome in FASTA format, hg19. I am reading the sequence into a Python dictionary with BioPython:

genome_dictionary = {} 
for seq_record.id in SeqIO.parse(input_fasta_file, "fasta"):
    genome_dictionary[seq_record.id] = seq_record

This creates a Python dictionary with keys as chromosome names and values as strings from the FASTA.

My goal is to take the FASTA, and simulate artificial SNPs and InDels along the FASTA to benchmark various algorithms. If I was only creating SNPs in the FASTA, this would be a trivial problem. The reference FASTA has the original coordinates, and the SNPs would also be in a coordinate system which is the same.

However, when I randomly insert an insertion (e.g. an insertion of 12 bp), I move all events into a new coordinate system such that

new coordinate system = old coordinate system + 15

This becomes very difficult to keep track off throughout the entire genome after several interations. I therefore cannot keep track of the new and old coordinate system without running an algorithm and creating a VCF, which may have errors. This defeats the purpose of the simulation.

The output would be either a BED file or VCF which has keep track of the changes.

What method/data structure is used to solve this problem? Would LiftOver work? Maybe CrossMap?

http://genome.sph.umich.edu/wiki/LiftOver http://crossmap.sourceforge.net/

terdon
  • 10,071
  • 5
  • 22
  • 48
ShanZhengYang
  • 1,691
  • 1
  • 14
  • 20
  • What are you using to create the changes? Why not just keep track of their location with it? – Devon Ryan Jun 20 '17 at 19:27
  • @DevonRyan These are just string manipulations, e.g. delete 10 characters is a 10 bp deletion. Keeping track isn't trivial----a delete of 10 bp means all bases must be moved over 10 from the original location. – ShanZhengYang Jun 20 '17 at 19:30
  • Very informally, I'd just suggest that many bioinformatics file formats just don't really handle changed coordinate systems very well. If you can somehow do the benchmarking without having to manage changed coordinates, you might be better off. – Colin D Jun 20 '17 at 19:35
  • "If you can somehow do the benchmarking without having to manage changed coordinates, you might be better off." This doesn't seem possible – ShanZhengYang Jun 20 '17 at 19:39
  • @ShanZhengYang: Err, that is trivial to track, you only care where the changes are in the original file, not where the positions are in the result. – Devon Ryan Jun 20 '17 at 19:40
  • @DevonRyan Perhaps you could explain? I'm randomly inserting SNPs, insertions and deletions in the FASTA. After +3 million events, I need to know how the new coordinates relate to the old coordinates, e.g. a VCF. This sounds quite algorithmically difficult to track to me. How would you do this? – ShanZhengYang Jun 20 '17 at 20:41
  • @ShanZhengYang: As you write the the output, track (1) where you are and (2) whether you insert/delete/mutate the base. As you make a modification, write it to a VCF or other file. You don't need to do anything other than that. – Devon Ryan Jun 20 '17 at 20:46
  • @DevonRyan This still doesn't make sense. So, randomly draw all of the positions for the indels/SNPs. Then sort them. Then implement them one by one, keeping track of the changes made? – ShanZhengYang Jun 20 '17 at 20:50
  • @DevonRyan "As you make a modification, write it to a VCF or other file." I think these modifications must be in order. Otherwise, it doesn't work. Correct? – ShanZhengYang Jun 20 '17 at 22:06
  • @ShanZhengYang: You can define at each base what, if any, kind of variant to add there. Yes, you MUST do this in order. Whether you predefine where the variants will be (modify that from the back then) or do it as you write out the sequence (modify if from the front then) is entirely up to you and makes no real difference. – Devon Ryan Jun 21 '17 at 06:32
  • "with keys as chromosome names and values as strings": I think that the code you show, the values are not simple strings, but SeqRecord objects. – bli Jun 21 '17 at 09:42
  • @bli You're correct---I should edit. – ShanZhengYang Jun 21 '17 at 10:00
  • @DevonRyan This is making sense now. Thank you for the help. A separate question is if I wanted to simulate a small structural variation between chromosomes, like a chromosomal translocation....in this more complex example, it becomes more difficult to keep track of changes. – ShanZhengYang Jun 21 '17 at 10:11
  • @ShanZhengYang: Yeah, don't go crazy with those. – Devon Ryan Jun 21 '17 at 10:21
  • @DevonRyan What do you do with overlapping intervals for events? Here, the order will matter. e.g. insertion and SNP vs. SNP and insertion – ShanZhengYang Jun 22 '17 at 15:49
  • Any one location will only ever have one (or zero) modifications. – Devon Ryan Jun 22 '17 at 16:03
  • @DevonRyan My question may have been unclear. You sort the events. When you do this, it's possible several events will occur at the same (or mre likely, overlapping) genomic coordinates. For example, an insertion from [5,15] and a SNP at 10. – ShanZhengYang Jun 23 '17 at 00:19
  • That would be more conveniently described as an insertion at base 5 (10 bases) and an SNP at base 10. Don't think in terms of InDel end positions, just start and widths. Obviously a deletion can't overlap a SNP, so keep track of that. – Devon Ryan Jun 23 '17 at 06:38
  • 1
    Note: I see in the documentation that there is a SeqIO.to_dict function that seems to do your data reading in one step – bli Jul 05 '17 at 10:11

2 Answers2

5

I've written a handful of programs from scratch to simulate mutations and variations in real or simulated sequences.

The trick has always been to sort the variants by genomic coordinate, apply the variant with the largest coordinate first, then apply the variant with the second largest coordinate, all the way down to the variant with the smallest coordinate.

Indels and other variants that affect sequence structure only affect subsequent coordinates. So going in reverse order ensures variants at the beginning of the sequence don't throw off variants at the end of the sequence.

Daniel Standage
  • 5,080
  • 15
  • 50
  • 1
    "going in reverse order ensures variants at the beginning of the sequence don't throw off variants at the end of the sequence" This is ingenious, thanks! Remaining question: what would be the most efficient way to create the VCF? @DevonRyan above has recommended the approach of writing each variant into a VCF when you manipulate the sequence. Is this your approach, or would you suggest something different? – ShanZhengYang Jun 21 '17 at 04:21
  • That's going to give you the quickest results, because there are existing tools available for creating modified sequences from VCF files. – gringer Jun 21 '17 at 08:06
  • Separate question: Have you ever tried the above with structural variations? Let's say I wanted to simulate a small structural variation between chromosomes, like a chromosomal translocation....in this more complex example, it becomes more difficult to keep track of changes. – ShanZhengYang Jun 21 '17 at 10:12
  • @ShanZhengYang Nope, I've only ever done SNVs, indels, and inversions. Nothing more complex. – Daniel Standage Jun 21 '17 at 17:08
4

If you look for a program which would randomly introduce SNPs + short indels and then would save everything into a VCF file, DWGsim or Mason Variator could be a good choice. Then you can create a corresponding Chain file using bcftools consensus -c and transform various formats between these two coordinate systems using CrossMap.

Karel Břinda
  • 1,909
  • 9
  • 19
  • How do these algorithms generate an accurate VCF after randomly introducing SNPs and InDels? – ShanZhengYang Jun 20 '17 at 20:12
  • The SNPs / InDels are generated directly by these programs so the programs know the coordinates, etc. – Karel Břinda Jun 20 '17 at 20:21
  • I don't understand how they keep track of these though. If you randomly insert an InDel, it changes the entire reference sequence. – ShanZhengYang Jun 20 '17 at 20:29
  • @ShanZhengYang Mason probably generates an initial read in the reference coordinate. If the initial read overlaps with an indel in VCF, it modifies the read sequence to include the indel. This way, the final read uses the reference coordinate and contains the indel. You don't need to think about another coordinate system at all. – user172818 Jun 20 '17 at 21:47