Where to find a large text corpus?

Question

I am looking for large (>1000) text corpus to download. Preferably with world news or some kind of reports. I have only found one with patents. Any suggestions?

This thread appears to be off topic. See http://meta.stats.stackexchange.com/questions/1032/data-sourcing-we-need-to-make-up-our-mind/1033#comment2001_1033. — whuber, Jun 07 '12 at 21:39
This question appears to be off-topic because it is about finding a data set, rather than doing statistical analysis — Peter Flom, Nov 07 '13 at 13:12
@guaka, please do not bump such old posts for such minor edits, especially a post that is closed. It is true that our style preference is not to have "thanks", but for something this minor, we'd just leave it. — gung - Reinstate Monica, Mar 15 '19 at 13:48

score 9 · Accepted Answer · answered Nov 24 '11 at 21:48

9

Do not the Wikileaks texts suit you?

answered Nov 24 '11 at 21:48

adamo

201

But how could I download them in .txt – Dimitar Vouldjeff Nov 24 '11 at 22:44

score 6 · Answer 2 · answered Nov 25 '11 at 11:49

6

What about wikinews? Here's the latest database dump I could find: http://dumps.wikimedia.org/enwikinews/20111120/

You probably want the "All pages, current versions only."-version.

answered Nov 25 '11 at 11:49

mogron

878

This no longer works. – vy32 Feb 06 '16 at 21:48
dump link no longer works. dataset by region is small and outdated – HappyCoding May 24 '16 at 08:38

score 6 · Answer 3 · answered Nov 25 '11 at 12:14

6

The reuters text corpus is a classic in the field, and can be found here

answered Nov 25 '11 at 12:14

richiemorrisroe

2,924
19
18

It's not the most interesting (or diverse) corpus. The license is also restrictive relative to Wikileaks (public domain US documents) or wikinews. – ariddell May 17 '13 at 14:57
@ariddell i agree, but it is commonly used in introductory NLP examples, and its large enough to be useful in learning but small enough to be analysed on a good laptop. – richiemorrisroe May 28 '13 at 10:50

score 3 · Answer 4 · answered Nov 07 '13 at 12:45

3

http://endb-consolidated.aihit.com/datasets.htm contains 10K companies with textual descriptions

answered Nov 07 '13 at 12:45

Yuri

1

is currently outdated – Quonux Mar 25 '17 at 22:12

score 1 · Answer 5 · answered Nov 24 '11 at 21:52

1

If recency is not an issue, you can try

http://www.infochimps.com/datasets/20-newsgroups-dataset-de-duped-version

and there are other many more similar dataset in infochimp depending on your budget.

Regards, Andy.

answered Nov 24 '11 at 21:52

drhanlau

169

This no longer works – vy32 Feb 06 '16 at 21:48

score 1 · Answer 6 · answered Nov 25 '11 at 10:43

1

If you want precomputed n-grams, you could try the google books archive:

http://books.google.com/ngrams/datasets

answered Nov 25 '11 at 10:43

tdc

7,569

how this one can be used? – HappyCoding May 24 '16 at 08:39

Where to find a large text corpus?

6 Answers6

Linked

Related