Highest Voted 'web-crawlers' Questions - Webmasters Stack Exchange

9

votes

2 answers

Disqus thread migration. Gotchas?

I've been migrating a site to a new domain. The site itself is pretty straightforward (it uses Jekyll), and everything has gone fine -- except migration of Disqus threads. I've had partial success -- some of the threads have migrated successfully,…

web-crawlers

asked Oct 03 '12 at 15:05

sramsay

91
2

7

votes

2 answers

Is it possible for web crawlers to see static pages without following a link to them?

If I create a static page on a domain (http://www.domain.com/page.html), can a crawler still see it if there aren't any links to it anywhere on the site?

web-crawlers

asked Nov 01 '11 at 19:51

divided

183
1
5

6

votes

1 answer

List of all URI elements for a website

Does anyone know a good way of getting a list of all the URI elements for a website? I plan on moving a website to a new CMS and would like to setup some 301 redirects for articles, images, css, and js files that will be moved to a new location. I'm…

web-crawlers

asked Nov 24 '10 at 15:24

kwoodfriend

171
3

5

votes

1 answer

What is User-Agent "AF_ID="

We periodically (more frequently recently) have very agressive crawling activity coming from EC2 instances that give us a user agent that looks like AF_ID=. I've looked around for common User-Agent formats and I cannot seem to find any…

web-crawlers

asked May 14 '12 at 11:59

rbieber

121
3

5

votes

2 answers

How is crawler seeing unlinked directories / files?

I'm running a crawler on my website to test for broken links and such. It starts by using a URL like www.domain.com One curious thing is that it is showing directories with no internal links. For example, directory /example_dir/ is showing up in the…

web-crawlers

asked Nov 03 '11 at 14:54

edeneye

171
2

5

votes

1 answer

Consequences of blocking Semrush and other bots?

For typical websites, are there any disadvantages to my clients to blocking spiders like Semrush, Maxmind and the plethora of other "non search-engine" bots? Wouldn't blocking these significantly reduce competitive analysis and provide a cheap…

web-crawlers

asked Feb 09 '21 at 23:30

davidgo

7,904
1
18
26

4

votes

2 answers

Sosospider: what does it actually want?

On one website I have, looking at the logs, I find lots and lots of "Sosospider" hits. Is this the same for everyone, or just me? Now, I have never once been sent any traffic from anything which looks like it might be anything to do with Sosospider,…

web-crawlers

asked Oct 14 '10 at 00:04

delete

3

votes

4 answers

How can I make an AngularJS ecommerce website locally crawlable?

I'm trying to crawl our ecommerce website that runs on AngularJS in an effort to create a content inventory. I'm not really very good at coding. So far, I've tried DeepCrawl and Screaming Frog, but I'm unable to extract the data I need. It goes as…

web-crawlers

asked May 27 '16 at 01:18

Ellesa

131
2

3

votes

0 answers

What is a good open source web crawler?

I'm looking for a good open source web crawler and i found these: DataparkSearch, GNU Wget, GRUB, Heritrix, ht://Dig, HTTrack, ICDL, mnoGoSearch, Nutch, Open Search Server, PHP-Crawler, tkWWW Robot, Scrapy, Seeks, YaCy. But I can not decide which is…

web-crawlers

asked Mar 14 '13 at 13:02

Heberfa

131
2

3

votes

1 answer

How To Slow Down A Generic Bot?

There is a generic bot (only identified as 'bot*') consuming most of my bandwidth and processing power. Blocking its IP stops it but since it comes from a well-known search-engine, I'd rather slow it down instead (it may be doing some useful…

web-crawlers

asked Sep 28 '11 at 15:19

Itai

6,007
2
30
44

3

votes

3 answers

Blocking all search engines except the big ones

I would like to somehow be able to block all search engines except Google, Yahoo & Bing (and their related sites like Google Images) from crawling my site as they consume a lot of server and bandwidth but don't bring any traffic. Is this easily…

web-crawlers

asked Jul 28 '10 at 07:05

Craig

151
4

3

votes

2 answers

Prevent search engines indexing admin and other pages

My website has an Admin area which requires a login. There are also other pages that should not be indexed. I know that Robots.Txt entries or meta name="robots" content="noindex" or an X-Robots-Tag: noindex are intended to inform the SE. There…

web-crawlers

asked Dec 01 '22 at 17:58

Clarendon VT Webmaster

31
1

2

votes

2 answers

What is a tolerable request rate for bots?

I'm writing an indexing crawler for my hobby search engine. What would be a safe figure for requests per second so I wouldn't be mistaken for a DOS attack and I wouldn't get blocked by firewalls and such?

web-crawlers

asked Mar 12 '16 at 08:01

user81993

121
2

2

votes

1 answer

How to write robots.txt for one hosting which has several websites in different directories?

I have several websites on one hosting account. The main website is in the root. Other websites are in sub-folders off the root directory. In the robots.txt file for the main website, do I need to Disallow other website directories?

web-crawlers

asked Aug 22 '13 at 06:25

user18787

93
1
2
5

2

votes

1 answer

Does a page blocked for search engines get indexed after link share (+1)

I do my website development always on a subdomain, and that subdomain is blocked for searchengines. I do not want my lorem ipsum content and development domain name indexed by search engines. Let's say http://dev.mydomain.com Now am I working with…

web-crawlers

asked Nov 06 '11 at 03:08

Saif Bechan

1,590
1
14
24

Questions tagged [web-crawlers]