4

I'm a new Bioinformatic scientist working for a yeast genetics company.

Objective To create a database of yeast genomes from NCBI and identify SNP variants.

In my pipeline

  1. FastQC,
  2. Trimmomatic,
  3. BWA
  4. GATK

The method being to check the quality of the downloaded reads, trim off any unnecessary information, and to align them prior to VCF.

NCBI sequence data comes from both Illumina sequencing and PacBio.

Questions

  1. Is the pipeline reasonable, for example, is BWA ok over more recent developments?
  2. Could the same pipeline be used both for Illumina and PacBio sequences?

Note I don't plan on assembling any of the reads I have as I'm just looking to comb through to find variations.

user438383
  • 1,679
  • 1
  • 8
  • 21
rimo
  • 963
  • 1
  • 15
  • 2
    I attempted to edit the question with a view to it being more focused, albeit remains 'broad'. The OP can decide whether this is okay. In case the OP is unaware the preferred format is a focused, specific question to invite an equally focused answer. Further focusing the question to provide a single question would be great. The other components can be asked in separate question(s). – M__ Aug 31 '22 at 00:15
  • 1
    I'd suggest using minimap2 for long reads, BWA is not well suited to higher error reads. See the settings under its -x flag for the specific technology. – Maximilian Press Aug 31 '22 at 20:18
  • 2
    Thank you @M__ for cleaning up my question it reads a lot better now. – rimo Sep 01 '22 at 15:37
  • Would you reccomend BWA for Illumina reads and then minimap2 for PacBio reads then? @MaximilianPress – rimo Sep 01 '22 at 15:38
  • 1
    I'd recommend using minimap2 for both. I believe that Heng Li has written that minimap2 is at least as good as bwa for Illumina if you set the -x sr flag. – Maximilian Press Sep 01 '22 at 16:16
  • 2
    For pacbio variant calling, there are a few resources online: https://www.biostars.org/p/393823/, https://gatk.broadinstitute.org/hc/en-us/community/posts/360072716972-Variant-calling-with-PacBio-HiFi-reads – Maximilian Press Sep 01 '22 at 16:17
  • Hi @rimo, please look at the answer and Maximillian's response and consider upvoting and/or accepting. – M__ Sep 12 '23 at 12:18

1 Answers1

3

for illumina reads (only)

prerequisite: download n prepare

the fna n gff file of yeast reference

mapping_in_02.directory

1st step is mapping the downloaded reads to the reference genome:

  1. build index for the reference using the following command

bwa index ../01./GCF_000146045.2_R64_genomic.fna

index the reference genome for rapid searching and aligning

  1. do the mapping using bwa mem (better with bash shell script, then there will be no need for each individual one by one)

  2. After mapping, you could use GATK4 for the variation calling based on the mapping result.

The steps of variation calling by GATK4 include dealing with the mapping result ( mark duplicates and add read group ), raw variation identification ( joint call followed by single individual variation calling ), quality recalibration and realign, second round of variation calling and final variation filtering ( filtering is applied to haplotyping to remove uninformative mutations. An obvious filter is to remove SNPs with identical calls in all the samples ).

The final variation calling result with SNPs and InDels could be obtained and the final vcf file could be sent to SnpEff for further annotation ( an easier way is using SnpEff in Galaxy with the right reference genome version, eg R64-1-1.75 ).

The isolates from NCBI would indicate that there are 16 chromosomes and one mitochondrial chromosome after SnpEff analysis. In the analysis, the numbers of insertions and deletions will be obtained.

envs_h_gang_5
  • 189
  • 2
  • 11