
It’s a coincidence that I should come across that “evil robot” article about the same time I was looking into my web logs…
I haven’t complained in a while about bad bots» but turns out that a few of them have been hammering my blogs on a regular basis, increasing server memory usage and eating up my monthly bandwidth allotment.
My blogs aren’t high traffic sites so we’re not talking a couple of gigs or hundreds of megabytes but it does add up at the end of each month. It’ll only get worse if left unchecked as more and more content is added.
Oopsy with the IP Bans
I had taken steps to blog the biggest offenders and the networks from which they operate but I used the wrong order for Apache’s mod_access Order directive. I used the “Order deny,allow“ but put my “deny from“ lines before my “allow from all“ line.
What that did was have Apache process all the deny lines first and then it let everyone in because it was told to process the “allow from all“ line last.
Basically, all of my IP bans did nothing.
Lower Load Averages
After I fixed the Order fiasco, I was surprised to see a noticeable effect. The load averages (number of processes in the system queue) for my blogs dropped significantly. My load averages ranged from 3 to almost 5 for each of my blogs. With the IP bans working, the load averages have all been under 2 for everything; average is about 1.5.
I’m watching two ill-behaved bots (Yandex and cuil.com’s Twiceler) repeatedly smack their heads into the IP ban wall right at this very moment as I type this article and load averages still remain nice and low.
Benefit and Behaviour
I don’t mind if a bot crawls my site provided there is of some benefit to me. For instance, Google’s bot eats up a few megabytes but it’s compliant and respects robots.txt. I get a kick-back in terms of traffic by allowing Google to index my blogs.
But there are tonnes of crawlers out there from other companies or entities that are simply parasitic. The company profits off of the site data/content they collect and crawl through, they eat up site bandwidth and increase server load, don’t respect the site owner’s wishes (i.e. they don’t obey robots.txt) and give no tangible returns to the sites they crawl.
It’s a sad statement about how the web is used when even a small time, nobody blogger like me has to put up with this kind of crap. It does give me a little bit of an insight into the kind of bullshit that the big web sites have to deal with on a regular basis.
It ain’t pretty!