Blocking Spiders!

Blocking Spiders!

am 20.07.2007 20:16:02 von BrentMcCulloch

Hi There,

Just wondering what other people are doing out there to block spiders that
aren't obeying the robots.txt file.

We have seen a bunch of different spiders who seem to completely ignore
robots.txt and go about indexing our entire site anyways! We don't want this!

Anyone have any ideas? Has anyone out there done this type of thing before?

Every time these spiders crawl parts of our site that we don't want them on,
the application log fills with warnings. So, I was thinking to rig up a
temporary ban of the IPs of these spiders every time the events start showing
up in the Application log.

Not really sure how to do that though! :S

Any suggestions would be appreciated!!!!

Thanks a lot,
Brent.

Re: Blocking Spiders!

am 21.07.2007 00:52:41 von Deniz

On Jul 20, 11:16 am, Brent McCulloch
wrote:
> Hi There,
>
> Just wondering what other people are doing out there to block spiders that
> aren't obeying the robots.txt file.
>
> We have seen a bunch of different spiders who seem to completely ignore
> robots.txt and go about indexing our entire site anyways! We don't want this!
>
> Anyone have any ideas? Has anyone out there done this type of thing before?
>
> Every time these spiders crawl parts of our site that we don't want them on,
> the application log fills with warnings. So, I was thinking to rig up a
> temporary ban of the IPs of these spiders every time the events start showing
> up in the Application log.
>
> Not really sure how to do that though! :S
>
> Any suggestions would be appreciated!!!!
>
> Thanks a lot,
> Brent.

Unfortunately there is not much you can do. IP blacklist could be one
solution - I am sure they use public proxies. Other thing you can try
is to ban certain agent signature (e.g. if the offending agent is "ABC-
bot v1.0"; check the agent in every page header, respond with 404 if
it is in your offending list) However those bots can easily change
their agent name to legit browsers any time, so this is not a perfect
solution.

If you are running web applications with lots of forms, put CAPTCHA on
them so at least your data can be clean (unprotected forms are also
attraction for many of these bots)

Do they just crawl or try other things (CGI attacks, form submission,
sql injection etc?)

Deniz