Using tor and python to scrape Google Scholar

Question

I'm working on a project to analyse how journal articles are cited. I have a large file of journal article names. I intend to pass them to Google Scholar and see how many citations each has.

Here is the strategy I am following:

Use "scholar.py" from http://www.icir.org/christian/scholar.html. This is a pre written python script that searches google scholar and returns information on the first hit in CSV format (including number of citations)
Google scholar blocks you after a certain number of searches (I have roughly 3000 article titles to query). I have found that most people use Tor ( How to make urllib2 requests through Tor in Python? and Prevent Custom Web Crawler from being blocked) to solve this problem. Tor is a service that gives you a random IP address every few minutes.

I have scholar.py and tor both successfully set up and working. I'm not very familiar with python or the library urllib2 and wonder what modifications are needed to scholar.py so that queries are routed through Tor.

I am also amenable to suggestions for an easier (and potentially considerably different) approach for mass google scholar queries if one exists.

Thanks in advance

Paulo Scardine · Answer 1 · 2017-09-12T07:23:23.787

For me the best way to use TOR is setting up a local proxy like polipo. I like to clone the repo and compile locally:

git clone https://github.com/jech/polipo.git
cd polipo
make all
make install

But you can use your package manager (brew install polipo in mac, apt install polipo on Ubuntu). Then write a simple config file:

echo socksParentProxy=localhost:9050 > ~/.polipo
echo diskCacheRoot='""' >> ~/.polipo
echo disableLocalInterface=true >> ~/.polipo

Then run it:

polipo

See urllib docs on how to use a proxy. Like many unix applications, urllib will honor the environment variable http_proxy:

export http_proxy="http://localhost:8123"
export https_proxy="http://localhost:8123"

I like to use the requests library, a nicer wrapper for urllib. If you don't have it already:

pip install requests

If urllib is using Tor the following one-liner should print True:

python -c "import requests; print('Congratulations' in requests.get('http://check.torproject.org/').text)"

Last thing, beware: the Tor network is not a free pass for doing silly things on the Internet because even using it you should not assume you are totally anonymous.

Link rot, that is why link-only answers sucks... I should include the instructions in the answer, unfortunately I lack the time to do it right now, sorry. — Paulo Scardine, Sep 15 '14 at 13:12

score 0 · Answer 2 · answered May 31 '22 at 12:28

The most effective way is to use CAPTCHA solving service and residential proxies which is fast and reliable.

If you don't want to figure out how to use CAPTCHA or figure out which proxies to use, you can try Google Scholar API from SerpApi which is a paid API with a free plan that bypasses blocks on the backend.

Code and example in the online IDE to scrape publications from all available pages with the ability to save results to CSV:

import pandas as pd
import os, json
from serpapi import GoogleScholarSearch
from urllib.parse import urlsplit, parse_qsl


def serpapi_scrape_all_publications(query: str):
    params = {
        "api_key": os.getenv("API_KEY"),    # your SerpApi API key
        "engine": "google_scholar",         # search engine
        "hl": "en",                         # language
        "q": query,                         # search query
        "num": "100"                        # articles per page
    }

    # where data extraction happens on SerpApi backend.
    search = GoogleScholarSearch(params)

    publications = []

    publications_is_present = True
    while publications_is_present:
        results = search.get_dict()         # JSON -> Python dictionary

        for publication in results.get("organic_results", {}):
            publications.append({
                "title": publication.get("title"),
                "link": publication.get("link"),
                "result_id": publication.get("result_id"),
                "snippet": publication.get("snippet"),
                "inline_links": publication.get("inline_links"),
                "publication_info": publication.get("publication_info")
            })

        # checks for the next page and updates if present
        if "next" in results.get("serpapi_pagination", []):
            # split URL in parts as a dict() and update "search" variable to a new page
            search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
        else:
            publications_is_present = False

    print(json.dumps(publications, indent=2, ensure_ascii=False))


serpapi_scrape_all_publications(query="biology")

Outputs:

[
  {
    "title": "Fungal decomposition of wood: its biology and ecology",
    "link": null,
    "result_id": "LiWKgtH72owJ",
    "snippet": "",
    "inline_links": {
      "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=LiWKgtH72owJ",
      "cited_by": {
        "total": 1446,
        "link": "https://scholar.google.com/scholar?cites=10149701587489662254&as_sdt=400005&sciodt=0,14&hl=en&num=20",
        "cites_id": "10149701587489662254",
        "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=400005&cites=10149701587489662254&engine=google_scholar&hl=en&num=20"
      },
      "related_pages_link": "https://scholar.google.com/scholar?q=related:LiWKgtH72owJ:scholar.google.com/&scioq=biology&hl=en&num=20&as_sdt=0,14",
      "versions": {
        "total": 6,
        "link": "https://scholar.google.com/scholar?cluster=10149701587489662254&hl=en&num=20&as_sdt=0,14",
        "cluster_id": "10149701587489662254",
        "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C14&cluster=10149701587489662254&engine=google_scholar&hl=en&num=20"
      }
    },
    "publication_info": {
      "summary": "ADM Rayner, L Boddy - 1988"
    }
  }, ... other results
]

Disclaimer, I work for SerpApi.

Using tor and python to scrape Google Scholar

2 Answers2