5

I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty hard and I was wondering how to circumvent that even if only a little.

I have implemented a robots.txt policy. I posted it below..

User-agent: wget
Disallow: /

User-agent: libwww
Disallow: /

User-agent: *
Disallow: /  

Issuing a wget from my totally independent ubuntu box shows that wget against my server just doesn't seem to work like so....

http://myserver.com/file.csv

Anyway I don't mind people just grabbing the info, I just want to implement some sort of flood control, like a wrapper or an interceptor.

Does anyone have a thought about this or could point me in the direction of a resource. I realize that it might not even be possible. Just after some ideas.

Janie

Jane Wilkie
  • 161
  • 6
  • 1
    I know how to block them through htaccess. I just don't know how to throttle them. If you want the snippet for blocking them, let me know and I'll post it. – John Conde Jun 29 '11 at 19:23
  • That's great John! I'll take what I can get for now. It would help. – Jane Wilkie Jun 29 '11 at 20:08

1 Answers1

2

If you decide you want to block wget and libwww you can either redirect them to a page telling them why you're blocking them with this code:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww
RewriteRule ^(.*)$ http://www.example.com/blocked.html

Or you can flat out reject their request with this code:

SetEnvIfNoCase user-agent  "^wget " bad_bot=1
SetEnvIfNoCase user-agent  "^libwww" bad_bot=1
<FilesMatch "(.*)">
  Order Allow,Deny
  Allow from all
  Deny from env=bad_bot
</FilesMatch>

Just place either snippet in a .htaccess file in your root directory or the directory where the files they are downloading from are.

I have used the second snippet to block bots from a site I had that was getting scraped a lot. I haven't used the first snippet but it looks like it should work well if you choose to go that route.

John Conde
  • 86,255
  • 27
  • 146
  • 241