WebbanalysKeymetrics

Block bad web crawlers

Dodgy web crawlers frequently hit web servers resulting in many errors. Block bad web crawlers with an automatic return to sender method.

Futile crawler requests

Crawlers requesting non-existant pagesExamples of pointless and resource hogging requests to a website are those done by crawlers requesting pages with a suffix (such as .ASP, .PHP etc) that do not exist.

Bad crawlers exhibit this as one clear indicator in the server logs and given the habit of firing on multiple such calls quickly for different pages that they are easily found when doing a visual check of a web server log file.

Return to sender

Automated rebound in .htaccessOne way to manage such bad behaviour is to apply "RTS", i.e. bounce such calls back to sender, zero effort once in place and far better than abuse emails to a hosting provider that does nothing.

If your web server supports a .htaccess file then it is a very quick thing to activate, as seen in the example image all page requests with a specific suffix in the request gets a stern rebound.

429 - Too many requestsNote that each rebound is shipped of with a 429 status code (List of HTTP status codes), if any logic is available on the sending end then it might cause the crawler to slow down.

But in reality any request that trips the rule will always be one too many leaving the crawler to progressively increase delay until the end of time.

It doesn't stop all bad crawlers, but it's a start in reducing the sheer tenacity they exhibit. Adding blocking on user agent is another way to clean up crawlers if the directives in the robots.txt file is ignored.

---