2

Assume we have a query.fa file that contains sequences and we run:

blat -stepSize=5 -repMatch=2253 -minScore=20 -minIdentity=0 -out=pslx /genomes/mm10.fa.qz query.fa output.pslx

the output output.pslx file looks like this:

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block       blockSizes      qStarts  tStarts
        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
20      0       0       0       0       0       0       0       +       seq     20      0       20      chr9    124595110       44046930        44046950   20,      0,      44046930,       aaaagtatcagtgtgtatag,   aaaagtatcagtgtgtatag,
20      0       0       0       0       0       0       0       +       seq     20      0       20      chr9    124595110       44046930        44046950   20,      0,      44046930,       aaaagtatcagtgtgtatag,   aaaagtatcagtgtgtatag,

What would be a reasonable way to get the genomic contexts (5bp upsteam and 5bp downstream) for each aligned sequence.

For example, assume that blat found that the seq: AAATTGGGGAAAA aligns to chr2:100-113, so the question is how to get chr2:95-118 easily without reinventing the wheel.


I couldn't make it work with bedtools, because my genome's index file is corrupted, but this should work for others who have successfully used bwa or samtools to index their reference genome:

blat -stepSize=5 -repMatch=2253 -minScore=20 -minIdentity=0 -out=pslx /genomes/mm10.fa.qz query.fa output.pslx
awk 'NR>5 {print $14 "\t" $16-10"\t" $17+10}' output.pslx > regions.bed
bedtools getfasta -fi /genomes/mm10.fa.gz -bed regions.bed
0x90
  • 1,437
  • 9
  • 18
  • 1
    https://biopython.org/DIST/docs/api/Bio.SearchIO.BlatIO-module.html – 0x90 Sep 03 '20 at 17:55
  • 1
    Creat a bed file with chr2 95 118 inside, and then use it with bedtools getfasta to extract your region – user3479780 Sep 03 '20 at 23:00
  • @user3479780 I tried that. Add some issue with indexing my reference genome. But this should work too. You are right. – 0x90 Sep 03 '20 at 23:01

2 Answers2

1

Via BEDOPS convert2bed (psl2bed) and bedops operations:

$ psl2bed < hits.psl | bedops --range 5 --everything - > answer.bed

The file answer.bed will contain target intervals from the PSL (BLAT) input, padded up- and downstream by five bases.

This BED file can be run through samtools faidx or similar to get sequence data.

References:

Alex Reynolds
  • 3,135
  • 11
  • 27
0

I couldn't get it to work with varied: pcl2bed, samtools, bedtools commands. I think this biopython should work:

import gzip
from Bio import SeqIO
n=5

exact_matches = {} blat_qresult = SearchIO.read('blat_output.pslx', 'blat-psl', pslx=True) for hit in blat_qresult: for f in hit.fragments: if f.hit_span == f.query_span: chrom = hit.id if chrom not in exact_matches: exact_matches[chrom] = [] exact_matches[chrom].append(f.hit_range)

with gzip.open('/genomes/mm10.fa.gz', "rt") as handle: for record in SeqIO.parse(handle, "fasta"): if record.id in exact_matches: for subseq in exact_matches[record.id]: print(record.seq[subseq[0]-n:subseq[1]+n])

0x90
  • 1,437
  • 9
  • 18