1

So I wanted to download massive amounts of COX1 (or COI) sequences from any available database. I found this script in https://bioinformatics.stackexchange.com/a/13187/16657 this comment that achieves just that using the bold R package. I changed the script to fit my data preference:

# load packages
library(tidyverse)
library(rentrez)
library(bold)    # API interface to BOLD
library(taxize)  # for NCBI taxonomy lookup
library(seqinr)  # for FASTA output

set_entrez_key()

get class-level taxa within "Mollusca" from NCBI taxonomy

taxa <- downstream("Mollusca", db = "ncbi", downto = "order")

check if taxa present in BOLD

checks <- bold_tax_name(taxa$Mollusca$childtaxa_name) taxa_bold <- checks[!is.na(checks$taxon),]$taxon

Download sequences from BOLD for each class-level taxon

sequences <- map(taxa_bold, bold_seq, marker = 'COI-5P') %>% flatten() %>% bind_rows()

Convert the list of sequences to a data frame

sequences_df <- do.call(rbind, sequences)

Write sequences to a file

write.fasta( sequences = as.list(sequences_df$sequence), names = as.list(sequences_df$id), nbchar = 80, file.out = 'coi5p.fasta' )

In the aforementioned thread, this script worked but now that I ran it I got the following error:

Error in `bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_trace()` to see where the error occurred.

I am not sure what to do for this. I commented on the original thread but since it is kind of old I decided to post about it here.

Note: the set_entrez_key() requires a NCBI API key which you can get by registering on NCBI.

Edit #1: The sequences object before the flatten() command is a large list containing 53 elements. After the command, the object becomes a large list with 257 elements.

Nickmofoe
  • 339
  • 1
  • 7
  • Can you please show the object as it appears before and after the 'flatten' command? It's unusual for me to see that command in R code, and this is just before the bind_rows command that you're concerned about, so it will probably be important for working out what's going wrong. – gringer Oct 29 '23 at 07:23
  • @gringer I edited the question with the details about the object. – Nickmofoe Oct 29 '23 at 07:55
  • Please show the head of the object, or the first few lines of output. Your text description is not detailed enough for me to understand what's going on. – gringer Oct 29 '23 at 11:30

1 Answers1

1

I managed to find the solution. After @gringer mentioned that they had not seen the flatten command I decided to dig deeper. Turns out the command has been superseded in purrr 1.0.0. So I will be posting the new script for anyone who seeks an updated version of the one in the old thread.

# load packages
library(tidyverse)
library(rentrez)
library(bold)    # API interface to BOLD
library(taxize)  # for NCBI taxonomy lookup
library(seqinr)  # for FASTA output
library(rBLAST)

set_entrez_key("d00fee673c4499df4c2479ba5fd71850b308") #my ncbi api key, this is essential for pulling the taxonomy through ncbi

get class-level taxa within "Mollusca" from NCBI taxonomy

taxa <- downstream("Mollusca", db = "ncbi", downto = "class") #setting the taxonomic level of the search. it will pull sequences for all mollusca but seperately for each class. if the search is large change to a lower taxonomic level

check if taxa present in BOLD

checks <- bold_tax_name(taxa$Mollusca$childtaxa_name) taxa_bold <- checks[!is.na(checks$taxon),]$taxon

Download sequences from BOLD for each class-level taxon

sequences <- map(taxa_bold, bold_seq, marker = 'COI-5P')

Convert the list of sequences to a data frame

sequences_df <- do.call(rbind, sequences)

Filter sequences based on the "COI-5P" marker. Despite the markes parameter, some files may have integrated multiple loci. The other sequences are removed from the R object

coi5p_sequences <- sequences_df %>% filter(marker == "COI-5P")

Write sequences to a file

write.fasta( sequences = as.list(coi5p_sequences$sequence), names = as.list(coi5p_sequences$id), nbchar = 60, file.out = 'coi5p_Mollusca.fasta' )

Nickmofoe
  • 339
  • 1
  • 7