7

Google searching for NM_002084 gives the following result:

NM_002084.4

This, I assume, is the latest version v4, hence the .4 suffix.

Searching for previous versions I get the following results, along with notes saying it was updated or removed.

NM_002084.3

This sequence has been updated. See current version.

NM_002084.2

NM_002084.1

Record removed. This record was replaced or removed.

Using biomaRt I can get the latest(?) version as follows:

library("biomaRt")

# define db
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# get refseqs
getBM(attributes = c('refseq_mrna',
                     'chromosome_name',
                     'transcript_start',
                     'transcript_end',
                     'strand'),
      filters = c('refseq_mrna'),
      values = list(refseq_mrna = "NM_002084"),
      mart = ensembl)
#  refseq_mrna chromosome_name transcript_start transcript_end strand
#1   NM_002084               5        151020438      151028992      1

But querying for specific versions gives nothing:

getBM(attributes = c('refseq_mrna',
                     'chromosome_name',
                     'transcript_start',
                     'transcript_end',
                     'strand'),
      filters = c('refseq_mrna'),
      values = list(refseq_mrna = c("NM_002084.1", "NM_002084.2", "NM_002084.3", "NM_002084.4")),
      mart = ensembl)
# [1] refseq_mrna      chromosome_name  transcript_start transcript_end   strand          
# <0 rows> (or 0-length row.names)

Question: How can I get all versions (preferably using R)?

bli
  • 3,130
  • 2
  • 15
  • 36
zx8754
  • 1,042
  • 8
  • 22
  • 1
    Can you please explain why it's important to get all versions? The most recent version is typically an update of the sequence to correct errors in previous versions. – gringer Jun 14 '17 at 22:31
  • @gringer True having the latest version should suffice, but I find it strange that it is not possible to get all versions automatically, when data is available online through manual search. – zx8754 Jun 15 '17 at 06:48

1 Answers1

6

I don't believe this is possible using biomaRt, nor using AnnotationHub.

I have two suggestions, neither of them very satisfactory. First, you can specify an Ensembl archive for biomaRt, for example:

mart72.hs <- useMart("ENSEMBL_MART_ENSEMBL", "hsapiens_gene_ensembl", 
                      host = "jun2013.archive.ensembl.org")

Of course, that requires that you have some idea of the date for each accession version and that the archives span that date - so not especially useful.

The other option is to access EUtils using e.g. rentrez, which does allow search by version number:

library(rentrez)
es <- entrez_search("nuccore", "NM_002084.1 NM_002084.2")
es$ids

[1] "4504104" "6006000"

So knowing the accession, you could simply append 1, 2, 3... to it, run the search and see if UIDs come back, then get them using entrez_fetch.

EDIT: here's a quick and dirty function which takes an accession as query, appends version = 1, fetches the ID, then increments the version and repeats until no more results are returned. It is not well-tested!

getVersions <- function(accession) {
  require(rentrez)
  ids <- character()
  version <- 1
  repeat({
    es <- entrez_search("nuccore", paste0(accession, ".", version))
    if(length(es$ids) == 0) {
      break
    }
    ids[version] <- es$ids
    version <- version + 1
  })
  ids
}

Example:

getVersions("NM_002084")
[1] "4504104"    "6006000"    "89903006"   "1048339180"
neilfws
  • 646
  • 4
  • 8
  • rentrez library looks like a good solution, could you please expand with code the last sentence? – zx8754 Jun 15 '17 at 06:42
  • Edited with some extra code. – neilfws Jun 15 '17 at 07:16
  • Are the ID's returned GI numbers? If yes, then they are not going to be very useful. – GenoMax Jun 15 '17 at 11:40
  • 1
    Well, that depends on what you want to do with the versions. If you want chromosome and coordinates as in the original question: that's always going to be difficult with old versions from old builds. If you want to retrieve the RefSeq records to compare versions, then UIDs are just fine. – neilfws Jun 15 '17 at 11:53