I am currently running a crawl of the internet using Nutch. This requires a list of URLs to start as a seed. I currently have a 500k url seed.
But I am looking for any open data sources to provide good starting seeds for web crawls.
I am currently running a crawl of the internet using Nutch. This requires a list of URLs to start as a seed. I currently have a 500k url seed.
But I am looking for any open data sources to provide good starting seeds for web crawls.
You can download the top 1 million (ZIP) sites from Alexa.
You may want to check out the Common Crawl dataset: http://commoncrawl.org/
It's hosted on Amazon s3 as HDFS with a 'pay as you access' model.
And that other question was mine. This question is different as I am looking for a Seed list, and not a crawl dump.
– Dan Ciborowski - MSFT Jun 24 '13 at 19:21