4

Is there a dataset containing a group of poems in English? I couldn't find any resource online. I could web scrape poetry sites, but I want to avoid copyright issues and even a dataset of older poems would do

philshem
  • 17,647
  • 7
  • 68
  • 170
cjds
  • 143
  • 1
  • 5

1 Answers1

4

To ensure public domain poetry, you can use Project Gutenberg. In particular, they have a "Bookshelf" specific to poetry. With some exceptions, Project Gutenberg is mostly English.

You can download all of Project Gutenberg, with their Harvest site. See here for more details - downloading with wget.

Probably more useful is to use a wrapper library like Python's Gutenberg. In this case, you would feed a list of e-book IDs (e.g. 12924) from the Bookshelf link to a python code, then download all the TXT data.

from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

id_list = [12924, 16786] # better is to read a parameter file
for id_nr in id_list: # loop over list
    text = strip_headers(load_etext(id_nr)).strip()
    # actually do something with `text`

Other languages have wrappers, too.

philshem
  • 17,647
  • 7
  • 68
  • 170
  • Is Project Gutenberg part of the US open data initiative? I didn't know that. So the U.S.government has data on "cultural" matters? – Tom Au Jul 18 '16 at 20:41