Open Seed for Crawl

Question

I am currently running a crawl of the internet using Nutch. This requires a list of URLs to start as a seed. I currently have a 500k url seed.

But I am looking for any open data sources to provide good starting seeds for web crawls.

What are your requirements for the seed? Are you interested in a specific area of the web? — Patrick Hoefler, Jun 24 '13 at 19:03
Spcific top would be "places"
And that other question was mine. This question is different as I am looking for a Seed list, and not a crawl dump. — Dan Ciborowski - MSFT, Jun 24 '13 at 19:21

score 4 · Accepted Answer · answered Jun 25 '13 at 08:54

4

You can download the top 1 million (ZIP) sites from Alexa.

answered Jun 25 '13 at 08:54

ramiro

2

Quantcast offers a free download of their top million US sites as well. It should also be noted that both downloads are very likely not Open Data. However, for seeding a web crawler, free (as in beer) should probably be good enough. – Patrick Hoefler Jun 25 '13 at 11:13

score 2 · Answer 2 · answered Jun 26 '13 at 23:12

2

You may want to check out the Common Crawl dataset: http://commoncrawl.org/

It's hosted on Amazon s3 as HDFS with a 'pay as you access' model.

answered Jun 26 '13 at 23:12

tmarthal

thanks for the comment. But I am using the seed to create my own web crawl because common crawl coverage is not great. – Dan Ciborowski - MSFT Jul 01 '13 at 13:52

2 Answers2