Hunting down annoying web spiders

We all enjoy having the GoogleBot and other search engine robots index our sites as it brings us higher on search engines, but it's annoying when some user scrapes your site for their own benefit. This is especially bad on forum sites as they're always a target, and it can severely impact server performance.

To hunt down these connections when the spidering is happening, simply run this command:

netstat -plan | grep :80 | awk '{print $5}' | sed 's/:.*$//' | sort | uniq -c | sort -rn

The IP's that are making the most connections will appear at the top of the list, and from there, you can find out which unwelcome spider is scraping your site.

Printed from: http://rackerhacker.com/2007/09/08/hunting-down-annoying-web-spiders/ .
© Major Hayden 2012.

Leave a Reply

 

  • Welcome! I started this blog as a way to give back to all of the other system administrators who have taught me something in the past. Writing these posts brings me a lot of enjoyment and I hope you find the information useful. If you spot something that's incorrect or confusing, please write a comment and let me know. Drop me a line if there's something you want to know more about and I'll do my best to write a post on the topic.
    -- Major Hayden

    Flattr this