How to get the genomic sequences from a blat result?

Question

Assume we have a query.fa file that contains sequences and we run:

blat -stepSize=5 -repMatch=2253 -minScore=20 -minIdentity=0 -out=pslx /genomes/mm10.fa.qz query.fa output.pslx

the output output.pslx file looks like this:

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block       blockSizes      qStarts  tStarts
        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
20      0       0       0       0       0       0       0       +       seq     20      0       20      chr9    124595110       44046930        44046950   20,      0,      44046930,       aaaagtatcagtgtgtatag,   aaaagtatcagtgtgtatag,
20      0       0       0       0       0       0       0       +       seq     20      0       20      chr9    124595110       44046930        44046950   20,      0,      44046930,       aaaagtatcagtgtgtatag,   aaaagtatcagtgtgtatag,

What would be a reasonable way to get the genomic contexts (5bp upsteam and 5bp downstream) for each aligned sequence.

For example, assume that blat found that the seq: AAATTGGGGAAAA aligns to chr2:100-113, so the question is how to get chr2:95-118 easily without reinventing the wheel.

I couldn't make it work with bedtools, because my genome's index file is corrupted, but this should work for others who have successfully used bwa or samtools to index their reference genome:

blat -stepSize=5 -repMatch=2253 -minScore=20 -minIdentity=0 -out=pslx /genomes/mm10.fa.qz query.fa output.pslx
awk 'NR>5 {print $14 "\t" $16-10"\t" $17+10}' output.pslx > regions.bed
bedtools getfasta -fi /genomes/mm10.fa.gz -bed regions.bed

https://biopython.org/DIST/docs/api/Bio.SearchIO.BlatIO-module.html — 0x90, Sep 03 '20 at 17:55
Creat a bed file with chr2 95 118 inside, and then use it with bedtools getfasta to extract your region — user3479780, Sep 03 '20 at 23:00
@user3479780 I tried that. Add some issue with indexing my reference genome. But this should work too. You are right. — 0x90, Sep 03 '20 at 23:01

Alex Reynolds · Answer 1 · 2020-09-23T23:11:17.050

Via BEDOPS convert2bed (psl2bed) and bedops operations:

$ psl2bed < hits.psl | bedops --range 5 --everything - > answer.bed

The file answer.bed will contain target intervals from the PSL (BLAT) input, padded up- and downstream by five bases.

This BED file can be run through samtools faidx or similar to get sequence data.

References:

score 0 · Answer 2 · answered Sep 03 '20 at 22:28

I couldn't get it to work with varied: pcl2bed, samtools, bedtools commands. I think this biopython should work:

import gzip
from Bio import SeqIO
n=5
exact_matches = {}
blat_qresult = SearchIO.read('blat_output.pslx', 'blat-psl', pslx=True)
for hit in blat_qresult:
    for f in hit.fragments:
        if f.hit_span == f.query_span:
            chrom = hit.id
            if chrom not in exact_matches:
                exact_matches[chrom] = []
            exact_matches[chrom].append(f.hit_range)
with gzip.open('/genomes/mm10.fa.gz', "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        if record.id in exact_matches:
            for subseq in exact_matches[record.id]:
                print(record.seq[subseq[0]-n:subseq[1]+n])

How to get the genomic sequences from a blat result?

2 Answers2