3

I am currently running a crawl of the internet using Nutch. This requires a list of URLs to start as a seed. I currently have a 500k url seed.

But I am looking for any open data sources to provide good starting seeds for web crawls.

Patrick Hoefler
  • 5,790
  • 4
  • 31
  • 47

2 Answers2

4

You can download the top 1 million (ZIP) sites from Alexa.

ramiro
  • 1,046
  • 1
  • 9
  • 12
2

You may want to check out the Common Crawl dataset: http://commoncrawl.org/

It's hosted on Amazon s3 as HDFS with a 'pay as you access' model.

tmarthal
  • 121
  • 3