1

I found a good answer related to my question here: How to get a list of genes corresponding to the list of SNPs (rs ids)?

But it says about small number of SNPs. I want to get gene names based on approximately 12000 to 19000 SNPs. NCBI says about 10 read per second with API key. Is there any efficient way to get gene names for thousands of SNPs? Is that database available to store locally? Any advice please?

bli
  • 3,130
  • 2
  • 15
  • 36

2 Answers2

1

Modify SNPs and gene annotations for your genome assembly of interest. This example shows how one might do this for hg19, for instance:

  1. Download SNPs and convert to BED:

    $ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/common_all_20180423.vcf.gz \
        | gunzip -c \
        | convert2bed --input=vcf --output=bed --sort-tmpdir=${PWD} - \
        > hg19.snp151.bed
    
  2. Filter this BED file for entries of interest:

    $ grep -wFf snps-of-interest.txt hg19.snp151.bed | cut -f1-6 > hg19.snp151.filtered.bed6
    
  3. Download gene annotations and convert to BED:

    $ wget -O - ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
        | gunzip -c \
        | awk '($3 == "gene")' \
        | gtf2bed \
        > gencode.v19.genes.bed
    

    Note: In the future, newer versions of Gencode annotations may be available for your reference assembly. Always check their site to confirm what is most recent.

  4. Map genes to SNPs:

    $ bedmap --echo --echo-map-id hg19.snp151.filtered.bed6 gencode.v19.genes.bed > snps_with_associated_gene_names.bed
    

A list of genes that overlap a SNP of interest will be in the seventh column of the output.

If this must be done within Python, you could use something like subprocess.call(...), where ... includes commands wrapped in bash -c.

I don't recommend this because of the need to escape quote marks and deal with Python's quirks in running command-line tasks, which makes a Python-based approach very difficult to set up and maintain. I'd suggest learning some basic bash scripting to make this easy and fast.

Alex Reynolds
  • 3,135
  • 11
  • 27
0

Run them through the Ensembl VEP. You'll get the gene name and all the information about how the genes are affected by the variants.

Emily_Ensembl
  • 1,769
  • 7
  • 9