So I wanted to download massive amounts of COX1 (or COI) sequences from any available database. I found this script in https://bioinformatics.stackexchange.com/a/13187/16657 this comment that achieves just that using the bold R package. I changed the script to fit my data preference:
# load packages
library(tidyverse)
library(rentrez)
library(bold) # API interface to BOLD
library(taxize) # for NCBI taxonomy lookup
library(seqinr) # for FASTA output
set_entrez_key()
get class-level taxa within "Mollusca" from NCBI taxonomy
taxa <- downstream("Mollusca", db = "ncbi", downto = "order")
check if taxa present in BOLD
checks <- bold_tax_name(taxa$Mollusca$childtaxa_name)
taxa_bold <- checks[!is.na(checks$taxon),]$taxon
Download sequences from BOLD for each class-level taxon
sequences <- map(taxa_bold, bold_seq, marker = 'COI-5P') %>%
flatten() %>%
bind_rows()
Convert the list of sequences to a data frame
sequences_df <- do.call(rbind, sequences)
Write sequences to a file
write.fasta(
sequences = as.list(sequences_df$sequence),
names = as.list(sequences_df$id),
nbchar = 80,
file.out = 'coi5p.fasta'
)
In the aforementioned thread, this script worked but now that I ran it I got the following error:
Error in `bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_trace()` to see where the error occurred.
I am not sure what to do for this. I commented on the original thread but since it is kind of old I decided to post about it here.
Note: the set_entrez_key() requires a NCBI API key which you can get by registering on NCBI.
Edit #1: The sequences object before the flatten() command is a large list containing 53 elements. After the command, the object becomes a large list with 257 elements.
bind_rowscommand that you're concerned about, so it will probably be important for working out what's going wrong. – gringer Oct 29 '23 at 07:23