I am looking for large (>1000) text corpus to download. Preferably with world news or some kind of reports. I have only found one with patents. Any suggestions?
Asked
Active
Viewed 2.6k times
16
-
This thread appears to be off topic. See http://meta.stats.stackexchange.com/questions/1032/data-sourcing-we-need-to-make-up-our-mind/1033#comment2001_1033. – whuber Jun 07 '12 at 21:39
-
This question appears to be off-topic because it is about finding a data set, rather than doing statistical analysis – Peter Flom Nov 07 '13 at 13:12
-
2Well that's awkward, because this Q&A is really useful. – Sideshow Bob Jan 07 '14 at 15:35
-
@guaka, please do not bump such old posts for such minor edits, especially a post that is closed. It is true that our style preference is not to have "thanks", but for something this minor, we'd just leave it. – gung - Reinstate Monica Mar 15 '19 at 13:48
6 Answers
6
What about wikinews? Here's the latest database dump I could find: http://dumps.wikimedia.org/enwikinews/20111120/
You probably want the "All pages, current versions only."-version.
mogron
- 878
-
-
dump link no longer works. dataset by region is small and outdated – HappyCoding May 24 '16 at 08:38
6
The reuters text corpus is a classic in the field, and can be found here
richiemorrisroe
- 2,924
- 19
- 18
-
It's not the most interesting (or diverse) corpus. The license is also restrictive relative to Wikileaks (public domain US documents) or wikinews. – ariddell May 17 '13 at 14:57
-
@ariddell i agree, but it is commonly used in introductory NLP examples, and its large enough to be useful in learning but small enough to be analysed on a good laptop. – richiemorrisroe May 28 '13 at 10:50
3
http://endb-consolidated.aihit.com/datasets.htm contains 10K companies with textual descriptions
Yuri
- 1
1
If recency is not an issue, you can try
http://www.infochimps.com/datasets/20-newsgroups-dataset-de-duped-version
and there are other many more similar dataset in infochimp depending on your budget.
Regards, Andy.
drhanlau
- 169