[NTLUG:Discuss] robots.txt

Tue Mar 25 20:23:27 CST 2003

On Tuesday 25 March 2003 05:04 pm, David Ross wrote:
> I understand that this has to do with search engines,but 
what should a proper 
> robots.txt file contain? and what file permissions should 
it have (444,400??)

I don't think file permissions are relevant -- if your web 
server can serve it, that's all that counts.

Here's an example of what one looks like -- it's basically 
a list of things you don't want the spiders to search:

# robots.txt file for Anansi Site
User-agent: *

Disallow: /Store
Disallow: /Image
Disallow: /Grid
Disallow: /Button
Disallow: /Banner
Disallow: /Background
Disallow: /Legal
Disallow: /Forum

There is of course one problem with this -- it relies on 
the spider to be cooperative.  In my experience, only the 
major search engine spiders will do that -- you will still 
get spidered.

For the ill-behaved spiders, there are such things as 
"spider traps" -- which take advantage of search mechanisms 
to provide essentially endless loops which the spider will 
proceed down until it eventually gives up and goes away.  
However, beyond knowing that they exist, I've never 
explored this option.

Another way to avoid excessive spidering is to make access 
to protected areas work only by an HTTP POST operation 
(i.e. a form submission).  Spiders generally do not fill 
out and submit web forms.

Cheers,
Terry

--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks  http://www.anansispaceworks.com