5

I have 3224 Ensembl id's as rownames in a dataframe "G". To convert Ensembl ids into Genesymbols I used biomart like following.

library('biomaRt')
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- rownames(G)
G <-G[,-6]
G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol"),values=genes,mart= mart)

Now in G_list I can see only 3200 ensembl ids showing Genesymbols / No Gene_symbols. Why the other 24 ensembl ids are not seen in G_list? If there are no gene_symbol for those 24 ensembl ids it should atleast show "-"

Examples of problematic IDs are: ENSG00000257061, ENSG00000255778, ENSG00000267268. These are not at all shown in G_list (biomaRt). So, I gave them in biodbnet, which seems to handle them.

what is the problem here?

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
stack_learner
  • 1,262
  • 14
  • 26

2 Answers2

6

It looks like you were using an old annotation. The problematic IDs you posted existed in the GRCh37 annotations, but don't in the most recent GRCh38 annotation. For that reason they were excluded. The IDs that have - as symbols don't have associated symbols, but are present in the database.

To use an archived version in biomart:

mart = useDataset("hsapiens_gene_ensembl", useEnsembl(biomart="ensembl", version=84))

That's an example for release 84.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • 1
    biomaRt will return NA for genes that are in the database, but don't have symbols. – Ian Sudbery Aug 25 '17 at 12:11
  • @IanSudbery Yeah, I'm assuming that OP replaced NA with - at some point. – Devon Ryan Aug 25 '17 at 12:13
  • As you said, I thought the same that the problematic ID's are existed in GRCh37. And tried with that too. And among 3224, I found only 3089 in the results. – stack_learner Aug 25 '17 at 12:15
  • Those IDs existed in the GRCh38 release 78 annotation (just checked a local GTF), so perhaps they're still GRCh38, but just an older release of it. – Devon Ryan Aug 25 '17 at 12:19
  • I actually did it in this way. install.packages("devtools") devtools::install_github("stephenturner/annotables") library(dplyr) library(annotables) grch38 – stack_learner Aug 25 '17 at 12:30
  • So ENSG00000257061 was last seen in ENSEMBL 84 (you can look this up by searching on the ENSEMBL website). You need to search against the exact same version as you got your gene list from. Replace useMart with useEnsembl(version=X) where X is the version of ensembl you used to generate the gene list. – Ian Sudbery Aug 25 '17 at 12:30
  • So...can anyone of you please tell how can I use the latest version in the above code? or Is there any R package for biodbnet? – stack_learner Aug 25 '17 at 12:53
  • @user3351523 I've updated my answer. The biodbnet question should be a separate post. – Devon Ryan Aug 25 '17 at 13:04
  • 1
    Just by-the-by, I'm going to guess that any ID that has been retired is unlikely to map to a HGNC symbol. – Ian Sudbery Aug 25 '17 at 13:07
  • @DevonRyan With the above mentioned version 84 I got 3220 now. Still 4 ids missing. – stack_learner Aug 25 '17 at 15:06
  • Try version 83 then. – Devon Ryan Aug 25 '17 at 16:29
  • @user3351523, you have to use the same ensembl version as was used for generating your genes object, or you'll never get 100% conversion. – Ian Sudbery Aug 25 '17 at 16:37
  • I used "gProfileR". There are no problematic IDs with that. It converted ENSEMBL IDs to Gene_symbols and made my work easier with GO analysis too. There is an R package and also API. Thank you all for the help !! – stack_learner Aug 29 '17 at 08:12
2

I used gProfileR where there are no problematic IDs with that. It converted ENSEMBL IDs to Gene_symbols and made my work easier with GO analysis too. There is an R package and also API.

llrs
  • 4,693
  • 1
  • 18
  • 42
stack_learner
  • 1,262
  • 14
  • 26