4

Grepping out short sequences from a fasta or fastq file is a really useful way to look at sequencing data. Using the option --color=always makes this even more useful, as you can visualize where the sequences appear in sequencing reads. For example:

zgrep 'GATCGATC' --color=always My_Hiseq_Data.fq.gz

grep color output

To make this more useful, how can I align the output of grep --color=always such that the beginning of the highlighted parts are aligned?

aligned output

terdon
  • 10,071
  • 5
  • 22
  • 48
conchoecia
  • 3,141
  • 2
  • 16
  • 40

2 Answers2

6

Perhaps, grep is not the best tool to use in this case, but it should be in principle possible by using grep & sed. Here is an example showing three symbols around a match.

zcat My_Hiseq_Data.fq.gz | \
      grep -Eo '.{0,3}GATCGATC.*' | \
      sed -En 's/.*/   \0/; s/.*(.{3}GATCGATC.{0,3}).*/\1/p' | \
      grep --color=always GATCGATC

Here are some explanations:

  1. zcat will just decompress a file.
  2. grep will output all the matches with 3-symbol context (or less). This is needed because a line can have multiple matches.
  3. first sed command will add 3 symbols before all the matches.
  4. second sed command will output the sequence with 3 surrounding characters on the left and max 3 characters on the right. This is an equivalent of grep -Eo '.{3}GATCGATC.{0,3}'.
  5. Finally, now that all matches are aligned, grep will color your sequence.

This code is possible to adapt to include more symbols of context.

I would recommend doing this in another language, though. E.g., in Python:

#!/usr/bin/env python3
import sys
import re

HL = '\033[93m'
ENDC = '\033[0m'


filename = sys.argv[1]
seq = sys.argv[2]
context = 10

seqre = re.compile(r'(.{0,%d})(%s)(.{0,%d})' % (context, sys.argv[2], context))

for line in open(filename):
    line = line.rstrip()
    for m in seqre.finditer(line):
        print(m.group(1).rjust(context),
              HL,
              m.group(2),
              ENDC,
              m.group(3),
              sep='')

UPD

A more advanced Python version inspired by this answer with support for multiple files, gzip, etc.

screenshot of match script

#!/usr/bin/env python3
import sys
import re
import argparse
import gzip
from termcolor import  colored


def align_matches(f, seqre, context):
    # for every line
    for line in f:
        # get rid of newline character
        line = line.rstrip()
        for m in seqre.finditer(line):
            # for every match print right-aligned left-context,
            # match colored in red, and right-context 
            print(m.group(1).rjust(context),
                  colored(m.group(2), 'red'),
                  m.group(3),
                  sep='')


# this function will check file extension to be able
# to open text files as well as gzip-compressed files
def myopen(fn):
    if fn.endswith('.gz'):
        return gzip.open(fn, 'rt')
    else:
        return open(fn, 'rt')



if __name__ == '__main__':
    # define script arguments
    parser = argparse.ArgumentParser(description='Aligned grep')
    parser.add_argument('pattern', type=str,
                        help='a valid regular expression')
    parser.add_argument('input', type=myopen, nargs='+',
                        help='files names, could be a gzip compressed file ending with .gz')
    parser.add_argument('--context', '-c', default=30, type=int,
                        help='number of context characters to show')
    args = parser.parse_args()

    # compile a regular expression with a match surrounded
    # by at most args.context number of characters
    seqre = re.compile(r'(.{0,%d})(%s)(.{0,%d})' % (
        args.context, args.pattern, args.context))

    # for every input file run align_matches
    for f in args.input:
        align_matches(f, seqre, args.context)
Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
2

I wrote a Perl script to do this. Just like zgrep, it can search through as many files as you give it. You can use the -c switch to tell it how much context (how many characters) you want around the match. So -c10 will search for your target string and 10 characters on either side of it. If you don't use the -c, it will default to 155, which should include the entire sequence for most cases.

To use it, save as alignString.pl somewhere in your $PATH (~/bin should work if you're using Ubuntu, for example), make it executable with chmod a+x ~/bin/alignString and then run it like this:

alignString.pl GAAAATCAG file1.fastq.gz file2.fastq.gz file3.fastq.gz

For example:

example output

The code:

#!/usr/bin/env perl

use warnings;
use strict;
use Term::ANSIColor;
use Getopt::Std;

my %opts;
getopts('c:',\%opts);
my $context = $opts{'c'} || 50;

## This subroutine does the alignment
sub align{
  ## Read the line and the target pattern
  my ($line, $pat) = @_;
  ## This will find 0 to $context characters on either side
  ## of the search pattern. Note that it will find the first
  ## occurrence of the pattern.
  $line =~ /(.{0,$context})$pat(.{0,$context})/;
  ## Save the strings to the left and right of the pattern
  my $left = $1;
  my $right = $2;
  ## If the left hand string is shorter than the value of $context,
  ## pad it with spaces so it can be aligned.
  while (length($left) < $context) {
    $left = " " . $left;
  }
  ## Return the padded left hand string, the pattenr colored red
  ## and the right hand string unchanged.
  return "$left" . colored($pat,"red") . "$right";
}

## The search pattern is the 1st argument
my $pattern = shift;
## Any remaining arguments are taken as target files to search
my @fastqFiles = @ARGV;

## Run the zgrep command
open(my $grepStream, '-|', "zgrep $pattern @fastqFiles");
## For each line returned by zgrep
while (my $line = <$grepStream>) {
  ## Align this line
  $line = align($line, $pattern);
  print "$line\n";
}
terdon
  • 10,071
  • 5
  • 22
  • 48
  • wonderful! I've resisted learning perl for a long time but it seems useful sometimes... – conchoecia Oct 27 '18 at 20:10
  • Beatiful! :) I didn't know about Term::ANSIColor, it is so simple to use! Your left padding while loop can probably be shortened to: $left = " " x ($context - length($left)) if length($left) < $context; – Peter Menzel Oct 28 '18 at 11:02
  • @PeterMenzel thanks :) and yes, various bits of this can be golfed but I was trying to keep it as clear to non Perl people as possible. – terdon Oct 28 '18 at 11:07
  • @terdon - over a year later I found an edge case where this doesn't work. If you search for a homopolymer like 'TTTTT', if there is a case where there are multiple matches (such as in a longer homopolymer 'TTTTTTTTTT'), this perl script will align to the rightmost match- not the leftmost – conchoecia Oct 23 '19 at 04:58