How to compare sequences of genes obtained after whole-genome sequencing to reference genome?

Question

My idea is to align whole-genome sequencing data (as fastq files after, 30× coverage, gDNA) to the reference mouse genome (NCBI), extract the immunoglobulin locus and compare it to the reference. I think the question is too broad but maybe a hint how to start ? (I'm new to the field.)

Task is: to check for difference between reference genome and a sample for 100-200 genes.

I checked these: question (but it's closed), answer (gave a hint about whole reference mapping not only the locus of interest), question (not answered), question (useful I guess but not sure).

Additional info: Instrument Platform: ILLUMINA
Instrument Model: HiSeq X Ten
Library Layout: PAIRED
Library Strategy: WGS
Library Source: GENOMIC

Typical sequence length distribution from FastQC: 150nt.

I have Win10 with WSL 1 installed.

Please [edit] your question and give us more details. What species are you working with? What technique? Is this DNAseq or RNAseq or something else? Do you know that the locus is or is that what you want to find out? — terdon, Oct 21 '21 at 11:53
@terdon: Thanks for the comment. I've edited. But I didn't understand your question "What technique?". What do you mean? — abc, Oct 21 '21 at 20:47
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Oct 23 '21 at 14:40
I've managed to get BAM and BAI files with https://usegalaxy.org. Now, I'm trying to visualize with IGV... — abc, Nov 07 '21 at 22:33

conchoecia · Answer 1 · 2021-11-21T00:50:50.803

2

Surprisingly there aren't actually many helpful one-liners to map Illumina WGS reads on the internet - everyone just sort of assumes that people can figure it out from the manual. Many people end up doing it in a not-so-ideal way, in that they stop at the .sam files that bwa mem makes that take up a lot of disk space.

I like this incantation:

# index the reference
bwa index reference_genome.fasta
map your reads to the reference (bwa mem)
compress that output to a bam file (samtools view -hb)
sort the bam file (samtools sort)
bwa mem -t <number_of_threads> reference_genome.fasta your_sample.R1.fastq.gz your_sample.R2.fastq.gz | 

samtools view -hb - | 

samtools sort - > your_reads_to_reference.sorted.bam
index the bamfile. makes a .bai file
samtools index your_reads_to_reference.sorted.bam

With this sorted .bam, the .bai, and the reference genome you can now use IGV to directly look at the reads mapped to your genome.

Also with the bam, you can generate a VCF file using freebayes or another tool like samtools mpileup. This will give you a file of how your sample is different from the reference genome. You can use bedtools intersect to get the variants only from your gene/regions of choice.

edited Nov 21 '21 at 00:50

answered Nov 07 '21 at 17:03

conchoecia

3,141
2
16
40

conchoecia, thank you for your answer and suggestions. I did Bowtie2 at https://usegalaxy.org. Liam McIntyre and you suggest bwa mem, minimap2, or ngmlr. My reads are ~150nt. What tool should I choose? I guess for 150nt bwa mem is better. Currently with .bam and .bai files I'm trying to visualize data in IGV, but yes, genes of interest are preferred to show. I hope to completely switch to command line – abc Nov 07 '21 at 22:46
1

bwa mem is the "standard" for Illumina WGS reads. minimap2 is the "standard" for long WGS reads, like PacBio or ONT. – conchoecia Nov 08 '21 at 06:27
Then I'll do bwa mem. Thank you – abc Nov 08 '21 at 06:53
according to timpo0's comment https://github.com/jts/methylation-analysis/issues/13#issuecomment-370628831 with reference to https://github.com/lh3/bwa ("For all the algorithms, BWA first needs to construct the FM-index for the reference genome (the index command)."): bwa index reference.fasta precede the rest lines – abc Nov 11 '21 at 22:27
.bam file's size is 0 bytes... – abc Nov 13 '21 at 07:21
1

@abc I also got 0 bytes also. here is a possible helper bwa index ref.fa; bwa mem ref.fa out1.fq out2.fq | samtools sort - -o out.sorted.bam. I am using samtools 1.14 and bwa-0.7.17-r1188 – Colin D Nov 16 '21 at 05:49
1

you can also try this command from the samtools docs "fastq to bam", http://www.htslib.org/workflow/#fastq_to_bam just swap minimap2 with bwa mem e.g. bwa mem ref.fa out1.fq out2.fq | samtools fixmate -u -m - - | samtools sort -u -@2 - | samtools markdup -@8 --reference ref.fa - final.cram – Colin D Nov 16 '21 at 05:56
@ColinD: thanks for suggestion, 9h run (1st code) resulted in 12 out.sorted.bam.tmp.00XX.bam files of ~260Mb, the run is still going. A file is generated ~ every 30mins. In shell I see [M::process] read 70190 sequences (10000248 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (8, 26835, 0, 4) [] skip orientation FF as there are not enough pairs [] analyzing insert size distribution for orientation FR... [] (25, 50, 75) percentile: (172, 245, 331) [] low and high boundaries for computing mean and std.dev: (1, 649) [] mean and std.dev: (257.71, 118.14) <...> – abc Nov 17 '21 at 07:09
<...> [] low and high boundaries for proper pairs: (1, 808) [] skip orientation RF as there are not enough pairs [] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 70060 reads in 64.900 CPU sec, 64.655 real sec – abc Nov 17 '21 at 07:11
1

that all sounds expected :) – Colin D Nov 17 '21 at 11:52

score 1 · Answer 2 · answered Oct 21 '21 at 20:39

1

This is very easy. First map all reads to the reference sequence for the organism. If they are short reads use BWA mem. If long reads use minimap2 or ngmlr. Once mapped view the locus of interest in igv (GUI). If you need to extract the reads at the locus pass the coordinates to samtools view (cli).

answered Oct 21 '21 at 20:39

Liam McIntyre

579
4
11

MvIntyre: thank you for the answer, may be you can give advice any source on the topic as well ? I found BWA-MEM: https://github.com/lh3/bwa and https://www.ridom.de/u/BWA_Mapper.html, IGV: http://software.broadinstitute.org/software/igv/home, Samtools: http://www.htslib.org/ – abc Oct 22 '21 at 06:19
1

They are all git projects – Liam McIntyre Oct 23 '21 at 10:13

How to compare sequences of genes obtained after whole-genome sequencing to reference genome?

2 Answers2

map your reads to the reference (bwa mem)

compress that output to a bam file (samtools view -hb)

sort the bam file (samtools sort)

index the bamfile. makes a .bai file