Questions tagged [web-crawlers]

A computer program that accesses web pages for various purposes (to scrape content, to provide search engines with information about your site, etc.)

A computer program that accesses web pages for various purposes (to scrape content, to provide search engines with information about your site, etc.)

906 questions
9
votes
2 answers

Disqus thread migration. Gotchas?

I've been migrating a site to a new domain. The site itself is pretty straightforward (it uses Jekyll), and everything has gone fine -- except migration of Disqus threads. I've had partial success -- some of the threads have migrated successfully,…
sramsay
  • 91
  • 2
7
votes
2 answers

Is it possible for web crawlers to see static pages without following a link to them?

If I create a static page on a domain (http://www.domain.com/page.html), can a crawler still see it if there aren't any links to it anywhere on the site?
divided
  • 183
  • 1
  • 5
6
votes
1 answer

List of all URI elements for a website

Does anyone know a good way of getting a list of all the URI elements for a website? I plan on moving a website to a new CMS and would like to setup some 301 redirects for articles, images, css, and js files that will be moved to a new location. I'm…
kwoodfriend
  • 171
  • 3
5
votes
1 answer

What is User-Agent "AF_ID="

We periodically (more frequently recently) have very agressive crawling activity coming from EC2 instances that give us a user agent that looks like AF_ID=. I've looked around for common User-Agent formats and I cannot seem to find any…
rbieber
  • 121
  • 3
5
votes
2 answers

How is crawler seeing unlinked directories / files?

I'm running a crawler on my website to test for broken links and such. It starts by using a URL like www.domain.com One curious thing is that it is showing directories with no internal links. For example, directory /example_dir/ is showing up in the…
edeneye
  • 171
  • 2
5
votes
1 answer

Consequences of blocking Semrush and other bots?

For typical websites, are there any disadvantages to my clients to blocking spiders like Semrush, Maxmind and the plethora of other "non search-engine" bots? Wouldn't blocking these significantly reduce competitive analysis and provide a cheap…
davidgo
  • 7,904
  • 1
  • 18
  • 26
4
votes
2 answers

Sosospider: what does it actually want?

On one website I have, looking at the logs, I find lots and lots of "Sosospider" hits. Is this the same for everyone, or just me? Now, I have never once been sent any traffic from anything which looks like it might be anything to do with Sosospider,…
delete
3
votes
4 answers

How can I make an AngularJS ecommerce website locally crawlable?

I'm trying to crawl our ecommerce website that runs on AngularJS in an effort to create a content inventory. I'm not really very good at coding. So far, I've tried DeepCrawl and Screaming Frog, but I'm unable to extract the data I need. It goes as…
Ellesa
  • 131
  • 2
3
votes
0 answers

What is a good open source web crawler?

I'm looking for a good open source web crawler and i found these: DataparkSearch, GNU Wget, GRUB, Heritrix, ht://Dig, HTTrack, ICDL, mnoGoSearch, Nutch, Open Search Server, PHP-Crawler, tkWWW Robot, Scrapy, Seeks, YaCy. But I can not decide which is…
Heberfa
  • 131
  • 2
3
votes
1 answer

How To Slow Down A Generic Bot?

There is a generic bot (only identified as 'bot*') consuming most of my bandwidth and processing power. Blocking its IP stops it but since it comes from a well-known search-engine, I'd rather slow it down instead (it may be doing some useful…
Itai
  • 6,007
  • 2
  • 30
  • 44
3
votes
3 answers

Blocking all search engines except the big ones

I would like to somehow be able to block all search engines except Google, Yahoo & Bing (and their related sites like Google Images) from crawling my site as they consume a lot of server and bandwidth but don't bring any traffic. Is this easily…
Craig
  • 151
  • 4
3
votes
2 answers

Prevent search engines indexing admin and other pages

My website has an Admin area which requires a login. There are also other pages that should not be indexed. I know that Robots.Txt entries or meta name="robots" content="noindex" or an X-Robots-Tag: noindex are intended to inform the SE. There…
2
votes
2 answers

What is a tolerable request rate for bots?

I'm writing an indexing crawler for my hobby search engine. What would be a safe figure for requests per second so I wouldn't be mistaken for a DOS attack and I wouldn't get blocked by firewalls and such?
user81993
  • 121
  • 2
2
votes
1 answer

How to write robots.txt for one hosting which has several websites in different directories?

I have several websites on one hosting account. The main website is in the root. Other websites are in sub-folders off the root directory. In the robots.txt file for the main website, do I need to Disallow other website directories?
user18787
  • 93
  • 1
  • 2
  • 5
2
votes
1 answer

Does a page blocked for search engines get indexed after link share (+1)

I do my website development always on a subdomain, and that subdomain is blocked for searchengines. I do not want my lorem ipsum content and development domain name indexed by search engines. Let's say http://dev.mydomain.com Now am I working with…
Saif Bechan
  • 1,590
  • 1
  • 14
  • 24
1
2 3