How to efficiently get human gene names from NCBI based on a large list of SNPs

Question

I found a good answer related to my question here: How to get a list of genes corresponding to the list of SNPs (rs ids)?

But it says about small number of SNPs. I want to get gene names based on approximately 12000 to 19000 SNPs. NCBI says about 10 read per second with API key. Is there any efficient way to get gene names for thousands of SNPs? Is that database available to store locally? Any advice please?

Yes, you can download the NCBI data locally, see this page. – llrs Jul 31 '19 at 12:40 — llrs, Jul 31 '19 at 12:40

Alex Reynolds · Answer 1 · 2019-09-03T16:58:47.307

Modify SNPs and gene annotations for your genome assembly of interest. This example shows how one might do this for hg19, for instance:

Download SNPs and convert to BED:

$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/common_all_20180423.vcf.gz \
    | gunzip -c \
    | convert2bed --input=vcf --output=bed --sort-tmpdir=${PWD} - \
    > hg19.snp151.bed

Filter this BED file for entries of interest:

$ grep -wFf snps-of-interest.txt hg19.snp151.bed | cut -f1-6 > hg19.snp151.filtered.bed6

Download gene annotations and convert to BED:
```
$ wget -O - ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
    | gunzip -c \
    | awk '($3 == "gene")' \
    | gtf2bed \
    > gencode.v19.genes.bed
```
Note: In the future, newer versions of Gencode annotations may be available for your reference assembly. Always check their site to confirm what is most recent.

Map genes to SNPs:

$ bedmap --echo --echo-map-id hg19.snp151.filtered.bed6 gencode.v19.genes.bed > snps_with_associated_gene_names.bed

A list of genes that overlap a SNP of interest will be in the seventh column of the output.

If this must be done within Python, you could use something like subprocess.call(...), where ... includes commands wrapped in bash -c.

I don't recommend this because of the need to escape quote marks and deal with Python's quirks in running command-line tasks, which makes a Python-based approach very difficult to set up and maintain. I'd suggest learning some basic bash scripting to make this easy and fast.

I have SNP list. Can your suggested technique be applied in Python? — studentcoder, Aug 01 '19 at 09:38
Maybe look into subprocess.call() to run command line tasks inside Python. — Alex Reynolds, Aug 01 '19 at 11:46

score 0 · Answer 2 · answered Sep 03 '19 at 07:56

0

Run them through the Ensembl VEP. You'll get the gene name and all the information about how the genes are affected by the variants.

answered Sep 03 '19 at 07:56

Emily_Ensembl

1,769
7
9

How to efficiently get human gene names from NCBI based on a large list of SNPs

2 Answers2