16

I am looking for large (>1000) text corpus to download. Preferably with world news or some kind of reports. I have only found one with patents. Any suggestions?

the
  • 103
  • This thread appears to be off topic. See http://meta.stats.stackexchange.com/questions/1032/data-sourcing-we-need-to-make-up-our-mind/1033#comment2001_1033. – whuber Jun 07 '12 at 21:39
  • This question appears to be off-topic because it is about finding a data set, rather than doing statistical analysis – Peter Flom Nov 07 '13 at 13:12
  • 2
    Well that's awkward, because this Q&A is really useful. – Sideshow Bob Jan 07 '14 at 15:35
  • @guaka, please do not bump such old posts for such minor edits, especially a post that is closed. It is true that our style preference is not to have "thanks", but for something this minor, we'd just leave it. – gung - Reinstate Monica Mar 15 '19 at 13:48

6 Answers6

9

Do not the Wikileaks texts suit you?

adamo
  • 201
6

What about wikinews? Here's the latest database dump I could find: http://dumps.wikimedia.org/enwikinews/20111120/

You probably want the "All pages, current versions only."-version.

mogron
  • 878
6

The reuters text corpus is a classic in the field, and can be found here

richiemorrisroe
  • 2,924
  • 19
  • 18
  • It's not the most interesting (or diverse) corpus. The license is also restrictive relative to Wikileaks (public domain US documents) or wikinews. – ariddell May 17 '13 at 14:57
  • @ariddell i agree, but it is commonly used in introductory NLP examples, and its large enough to be useful in learning but small enough to be analysed on a good laptop. – richiemorrisroe May 28 '13 at 10:50
3

http://endb-consolidated.aihit.com/datasets.htm contains 10K companies with textual descriptions

Yuri
  • 1
1

If recency is not an issue, you can try

http://www.infochimps.com/datasets/20-newsgroups-dataset-de-duped-version

and there are other many more similar dataset in infochimp depending on your budget.

Regards, Andy.

drhanlau
  • 169
1

If you want precomputed n-grams, you could try the google books archive:

http://books.google.com/ngrams/datasets

tdc
  • 7,569