1

The following is a valid .fasta file content:

>HSBGPG Human gene for bone gla protein (BGP) GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGT ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACC ATGAG.

Is this also?

>Arbirary_Name_iJustCameUP_with_and_other_local_identifiers GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGT ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACC ATGAG.

rtviii
  • 364
  • 1
  • 7

2 Answers2

4

I am not aware of an official fasta format description. The only constraint that I know of is that there should not be any whitespace after >. Apart from that particular position, whitespaces are allowed in the header line.

So, regarding your particular example, both are valid fasta sequences. When in doubt, you can use SeqIO from Biopython, if you can parse your file with the following code, it is should be a valid fasta file.

from Bio import SeqIO

with open("example.fasta") as handle: for record in SeqIO.parse(handle, "fasta"): print(record.id)

Edit per @Chris_Rands' comment

The code below does the same as above, meaning that SeqIO.parse() takes care of opening and closing the file.

from Bio import SeqIO

for record in SeqIO.parse("example.fasta", "fasta"): print(record.id)

haci
  • 4,092
  • 1
  • 6
  • 28
  • 1
    For such a simple format, that's actually quite a few odd edge cases with FASTA (e.g. see https://bioinformatics.stackexchange.com/questions/388/are-there-any-databases-of-templates-for-common-bioinformatic-file-formats). There's nothing official about Biopython's definition of FASTA (and it's subject to change, e.g. https://github.com/biopython/biopython/issues/1814 ) but I agree it's a practical solution – Chris_Rands Jul 09 '21 at 17:17
  • 3
    Just a note: anything following a space or tab on a > line is considered comments. Those texts are not part of sequence names. – user172818 Jul 10 '21 at 14:09
  • @user172818 why are you saying that? Yes, some tools parse sequence names that way but not all, that's just a relatively common convention, but as far as I know there is no fasta standard. So what is considered a comment is down to the author of whatever tool you're using to parse the file. Is there some standard I don't know about? – terdon Jul 12 '21 at 18:04
  • @terdon, here is the link about the "no space after >" constraint: http://genetics.bwh.harvard.edu/pph/FASTA.html . To be honest I did not try if common fasta parsers work with a space or not. – haci Jul 12 '21 at 18:20
  • 1
    @haci yeah, that's just the format that PolyPhen requires, not a general characteristic of fasta which, since it isn't defined by any standard, doesn't really have any limitations at all. Here's another reputable source giving their own definition: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp. I am racking my brain but I can't remember where I used to see sequences like > foo (with a space). Either it was many years ago or I imagined it, or both. I'm still convinced I remember it but the brain does play tricks so I won't press the point. – terdon Jul 12 '21 at 18:25
  • @terdon This convention has been widely accepted by almost every mainstream tool for decades. Actually, section 2.2 in the fasta user guide gives an example of fasta: >sequence_name1 and description. I don't know if this example was in the earliest version of fasta, but anyway the original authors have accepted this convention, too. You can download sequences in fasta from NCBI or Ensembl and you will see they clearly put all sorts of auxiliary information after a space. Fastq has a paper which follows the same convention. – user172818 Jul 12 '21 at 21:30
  • @user172818 when you say "fasta user guide", do you mean the original fasta tool? If that already had that convention, then that's a pretty strong argument indeed. As for being an accepted convention, absolutely, no question about that. I just know that I've seen all sorts of stuff in various fasta headers in the wild so I'm wary of saying that anything is standardized in the format. – terdon Jul 12 '21 at 21:38
  • @terdon Yes, I was referring to this PDF from the fasta software website (section 2.2): https://fasta.bioch.virginia.edu/wrp_fasta/fasta_guide.pdf – user172818 Jul 12 '21 at 21:50
2

NCBI's FASTA format description:

https://www.ncbi.nlm.nih.gov/genbank/fastaformat/

NCBI's BLAST page describing valid FASTA input:

https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

NCBI's SNP page describing FASTA format:

https://www.ncbi.nlm.nih.gov/projects/SNP/snp_legend.cgi?legend=fasta

UniProt's FASTA-header description:

https://www.uniprot.org/help/fasta-headers

Wikipedia reference (indicating format origin):

https://en.wikipedia.org/wiki/FASTA_format

FASTA program (origin of format, per Wikipedia):

https://fasta.bioch.virginia.edu/wrp_fasta/fasta_guide.pdf

Harvard PolyPhen page describing FASTA format:

http://genetics.bwh.harvard.edu/pph/FASTA.html

FASTA file format (from Pacific Biosciences):

https://pacbiofileformats.readthedocs.io/en/3.0/FASTA.html

jubilatious1
  • 101
  • 3