0

Possible Duplicate:
Where to find a large text corpus?

I know someone has asked a similar question here, but I'm wondering whether anyone knows of a large textual corpus that is available for research use. The number of documents it contains isn't terribly important--rather, I'm looking for something that is on the TB-/PB-level that I can use to test the scalability of some algorithms. I thought about using the english Wikipedia data dump, but I think it's only about 25GB. My other thought was using a database of twitter messages, but, if I remember right, those aren't freely-available. Does anyone have a recommendation?

Kyle.
  • 1,570
  • 2
    It really is the same question. – whuber Jan 04 '13 at 00:08
  • @whuber So then is the answer "no one knows"? My impression from that question was that the author was interested in the size (i.e., the n) of the data set, rather than the size (i.e., the physical space on memory), and was receiving answers along these lines. My interest is the latter of the two, and I think the approaches one would use for each are very different. – Kyle. Jan 04 '13 at 00:51
  • @whuber Did I not make this distinction clear enough in my question, or do you still think they're the same issue? – Kyle. Jan 04 '13 at 00:52
  • Aren't the size of the dataset and the size in RAM very closely related? Regardless, as I indicated long ago in a comment to that other question, this inquiry is--at best--only marginally on topic here. – whuber Jan 04 '13 at 14:28

0 Answers0