6

I am trying to retrieve the URL of all articles listed inside wikipedia list articles. For concreteness lets consider the wikimedia list article List_of_American_scientists

I know how to use the wiki api and the following query gives a decent result:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=jsonfm&titles=List_of_American_scientists

But the result that it returns is a list in a single string. I will have to parse this string and convert the names to wikipedia urls myself. Is there a better way to retrieve the answers that can directly gives me the urls of the pages?

Pushpendre
  • 163
  • 2

2 Answers2

3

You can use the links API property to get the titles of all the links on the page: https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&titles=List_of_American_scientists&prop=links&pllimit=500. Note that this gives only the first 500, so you will need to make multiple queries to get them all, using plcontinue.

3

You might also use the SPARQL REST API to query wiki data. E.g. get a table of US citizens whose occupation is a subclass of scientist:

PREFIX schema: <http://schema.org/>

SELECT ?item ?itemLabel ?article WHERE {
  ?item wdt:P31 wd:Q5.
  ?item wdt:P27 wd:Q30.
  ?item wdt:P106 ?occupation.
  ?occupation wdt:P279 wd:Q901.
  ?article schema:about ?item.
  ?article schema:inLanguage "en".
  ?article schema:isPartOf <https://en.wikipedia.org/>.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?item ?itemLabel ?article
ORDER BY ?itemLabel
LIMIT 1000
aventurin
  • 191
  • 1
  • 1
  • 7