5

I would like to extract all the CDS from a batch of genomes. I have found a perl script from BioStars but this does not seem to work for me. I would preferably like to have a script/ method which will use the locus tag as a header.

Example

My gff files are from prokka output

ATCC0000    Prodigal:2.6    CDS 243 434 .   +   0   ID=FKKLIMLP_00001;Parent=FKKLIMLP_00001_gene;inference=ab initio prediction:Prodigal:2.6;locus_tag=FKKLIMLP_00001;product=hypothetical protein
ATCC0000    prokka  gene    243 434 .   +   .   ID=FKKLIMLP_00001_gene;locus_tag=FKKLIMLP_00001
ATCC0000    Prodigal:2.6    CDS 1727    2131    .   -   0   ID=FKKLIMLP_00002;Parent=FKKLIMLP_00002_gene;inference=ab initio prediction:Prodigal:2.6;locus_tag=FKKLIMLP_00002;product=hypothetical protein
AudileF
  • 955
  • 8
  • 25

5 Answers5

6

I like bedtools getfasta. My typical option set is bedtools getfasta -fi <reference> -bed <gff_file> -name -s. Be aware of the -s to make sure you are pulling the correct strand. I like bedtools because it is a versatile tool overall for handling bed, gff and vcf file manipulations.

# bedtools getfasta

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.26.0
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
    -fi Input FASTA file
    -bed    BED/GFF/VCF file of ranges to extract from -fi
    -name   Use the name field for the FASTA header
    -split  given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
    -tab    Write output in TAB delimited format.
        - Default is FASTA format.

    -s  Force strandedness. If the feature occupies the antisense,
        strand, the sequence will be reverse complemented.
        - By default, strand information is ignored.

    -fullHeader Use full fasta header.
        - By default, only the word before the first space or tab is used.
Bioathlete
  • 2,574
  • 12
  • 29
4

The gffread utility in Cufflinks package might be interesting for you. To generate a multi-fasta file with nucleotide sequences from your GFF file, then you can try:

gffread -w output_transcripts.fasta -g reference_genome.fa input_transcripts.gff
aechchiki
  • 2,676
  • 11
  • 34
3

Using python and this GFF parser that mimics Biopython's SeqIO parsers:

from BCBio import GFF

# Read the gff
for seq in GFF.parse('my_file.gff'):
    # only focus on the CDSs
    for feat in filter(lambda x: x.type == 'CDS',
                       seq.features):
        # extract the locus tag
        locus_tag = feat.qualifiers.get('locus_tag',
                                        ['unspecified'])[0]
        # extract the sequence
        dna_seq = seq = str(feat.extract(seq).seq)
        # simply print the sequence in fasta format
        print('>%s\n%s' % (locus_tag, dna_seq))
mgalardini
  • 977
  • 7
  • 18
  • 1
    AFAIK GFF3 is a version of GFF, and the page that advertises the BCBio parser indicates that it is compatible in reading/writing GFF3: http://biopython.org/wiki/GFF_Parsing – mgalardini Aug 21 '17 at 11:16
2

The xtractore program from the AEGeAn Toolkit was designed for this type of use case. Just set --type=CDS.

$ xtractore -h

xtractore: extract sequences corresponding to annotated features from the
           given sequence file

Usage: xtractore [options] features.gff3 sequences.fasta
  Options:
    -d|--debug            print debugging output
    -h|--help             print this help message and exit
    -i|--idfile: FILE     file containing a list of feature IDs (1 per line
                          with no spaces); if provided, only features with
                          IDs in this file will be extracted
    -o|--outfile: FILE    file to which output sequences will be written;
                          default is terminal (stdout)
    -t|--type: STRING     feature type to extract; can be used multiple
                          times to extract features of multiple types
    -v|--version          print version number and exit
    -V|--verbose          print verbose warning and error messages
    -w|--width: INT       width of each line of sequence in the Fasta
                          output; default is 80; set to 0 for no
                          formatting

It looks like you're processing prokaryotic genomes, but this program works also on eukaryotic genomes where CDSs often have introns interspersed.

Daniel Standage
  • 5,080
  • 15
  • 50
0

You can use agat_sp_extract_sequences.pl from AGAT. See here for all option: https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html.

Below example of use:

enter image description here

juke34
  • 311
  • 3
  • 9