7

How do I obtain a bed file with the list of N-regions of GRCh38 reference? This is, the regions where the sequence is a stretch of Ns.

719016
  • 2,324
  • 13
  • 19

6 Answers6

5
# if you have seqtk installed, skip the following two lines
git clone https://github.com/lh3/seqtk
cd seqtk && make
# the main command line
./seqtk cutN -gp10000000 -n1 hg38.fa > hg38-N.bed

Option -n sets the min stretch length. Just use -p as is. It is a bit complicated to explain what it is doing.

user172818
  • 6,515
  • 2
  • 13
  • 29
4

This information is stored in the 2bit file representation of the sequence, so if you happen to have a 2bit file locally (or want to download one from UCSC) and have py2bit installed (you'll need version 3.0, since I literally just added support for this):

import py2bit
tb = py2bit.open("genome.2bit")
of = open("NNNNN.bed", "w")  # Do change the input and output file names
for chrom in tb.chroms().keys():
    blocks = tb.hardMaskedBlocks(chrom)
    for block in blocks:
        of.write("{}\t{}\t{}\n".format(chrom, block[0], block[1]))
of.close()
tb.close()

In case it's helpful, you can also use this to get all soft-masked regions. Just replace hardMaskedBlocks() with softMaskedBlocks() and ensure to specify storeMasked=True when you open the file.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
3

Use twoBitInfo:

$ twoBitInfo file.fa -nBed output.bed

For example, to get all the N-masked regions on chromosome Y (also note you can use stdout as a filename to write directly to stdout, and use of the url as the input, no need to download the 2bit file):

$ twoBitInfo http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chrY -nBed stdout | head -3
chrY    0   10000
chrY    44821   94821
chrY    133871  222346

You can download twoBitInfo from the directory appropriate to your operating system here: http://hgdownload.soe.ucsc.edu/admin/exe/

2

I might as well jump on as well. Here's a Perl script that I wrote a while ago to split a fasta sequence at stretches of Ns:

https://github.com/gringer/bioinfscripts/blob/master/fasta-nsplit.pl

I've just modified it to spit out a BED formatted file showing the N locations on standard error. Use it as follows:

./fasta-nsplit.pl tig87_withN.fa 2>out.bed > out.split.fa

Output (BED file):

tig00000087 0   60
tig00000087 900 960
tig00000087 3840    3960
tig00000087 14880   14940
tig00000087 59520   59700
tig00000087 93000   93120
tig00000087 107880  107940
tig00000087 109135  109140

Here's the full script, for completeness / fixing / comment:

#!/usr/bin/perl
use warnings;
use strict;

my $seq = "";
my $shortSeqID = "";
my $seqID = "";
my $keep = 0;
my $cumLength = 0;
while(<>){
  chomp;
  if(/^>((.+?)( .*?\s*)?)$/){
    my $newID = $1;
    my $newShortID = $2;
    if($seq){
      my $inc = 0;
      while($seq =~ s/(NNNN+)(.*)//){
    my $nStretch = $1;
        my $newSeq = $2;
        printf(">%s.%s\n%s\n", $seqID, $inc++, $seq) if ($seq);
    $cumLength += length($seq);
    printf(STDERR "%s\t%d\t%d\n", $shortSeqID, $cumLength, 
           $cumLength + length($nStretch));
    $cumLength += length($nStretch);
        $seq = $newSeq;
      }
      printf(">%s\n%s\n", $seqID, $seq) if ($seq);
    }
    $seq = "";
    $shortSeqID = $newShortID;
    $seqID = $newID;
    $cumLength = 0;
  } else {
    $seq .= $_;
  }
}
if($seq){
  my $inc = 0;
  while($seq =~ s/(NNNN+)(.*)//){
    my $nStretch = $1;
    my $newSeq = $2;
    printf(">%s.%s\n%s\n", $seqID, $inc++, $seq) if ($seq);
    $cumLength += length($seq);
    printf(STDERR "%s\t%d\t%d\n", $shortSeqID, $cumLength, 
       $cumLength + length($nStretch));
    $cumLength += length($nStretch);
    $seq = $newSeq;
  }
  printf(">%s\n%s\n", $seqID, $seq) if ($seq);
}
gringer
  • 14,012
  • 5
  • 23
  • 79
1
  • @user172818 code can't count N at fa begining
  • @terdon code get wrong bed position
  • i change @terdon code like below; and it works for me
awk '{
    if (substr($1,1,1)==">")
        if (NR>1)
            printf "\n%s,", substr($0,2,length($0)-1)
        else
            printf "%s,", substr($0,2,length($0)-1)
    else
        printf "%s", $0
}
END{printf "\n"}' test.fa | \
perl -F"," -ane 'while($F[1] =~ /N+/ig){
                        for(0..$#-){                    
                            print "$F[0]\t$-[$_]\t$+[$_]\n";
                        }
                  }'
atongsa
  • 131
  • 4
1

Here's a way to generate it yourself from the genome sequence. First, convert the fasta file of the genome to tbl format (<seq id>\t<sequence>), then use perl to find the start and end positions of all stretches of consecutive of N or n.

FastaToTbl hg38.fa | 
    perl -F"\t" -ane 'while(/N+/ig){
                            for(0..$#-){
                                print "$F[0]\t$-[$_]\t$+[$_]\n"
                            }
                      }' > hg38.n.bed

Explanation

  • FastaToTbl : this is a very simple script that converts fasta to tbl. Just save the lines below somewhere in your $PATH (e.g. ~/bin) as FastaToTbl and make it executable.

    #!/usr/bin/awk -f
    {
        if (substr($1,1,1)==">")
                if (NR>1)
                    printf "\n%s\t", substr($0,2,length($0)-1)
                else 
                    printf "%s\t", substr($0,2,length($0)-1)
            else 
                printf "%s", $0
    }
    END{printf "\n"}
    
  • The Perl magic.

    • The -a makes perl behave like awk, splitting each input line on the value given by -F (a tab in this case) and saving the result in the aray @F. So, $F[0] will be the sequence id.
    • while(/N+/ig){ : this will match all cases of consecutive N or n (the i flag makes it case-insensitive). Perl will store the start positions of all matches in the special array @- and the end positions in @+.
    • for(0..$#-){ : iterate over all numbers from 0 to the final index ($#-) in the array @-.
    • print "$F[0]\t$-[$_]\t$+[$_]\n" : print minimal bed format data. The name of the current sequence ($F[0]), the start position of the current match ($-[$_]) and the corresponding end position ($+[$_]).

I just ran the above on my system and it generated this.

terdon
  • 10,071
  • 5
  • 22
  • 48