[NTLUG:Discuss] robots.txt
Terry Hancock
hancock at anansispaceworks.com
Tue Mar 25 20:23:27 CST 2003
On Tuesday 25 March 2003 05:04 pm, David Ross wrote:
> I understand that this has to do with search engines,but
what should a proper
> robots.txt file contain? and what file permissions should
it have (444,400??)
I don't think file permissions are relevant -- if your web
server can serve it, that's all that counts.
Here's an example of what one looks like -- it's basically
a list of things you don't want the spiders to search:
# robots.txt file for Anansi Site
User-agent: *
Disallow: /Store
Disallow: /Image
Disallow: /Grid
Disallow: /Button
Disallow: /Banner
Disallow: /Background
Disallow: /Legal
Disallow: /Forum
There is of course one problem with this -- it relies on
the spider to be cooperative. In my experience, only the
major search engine spiders will do that -- you will still
get spidered.
For the ill-behaved spiders, there are such things as
"spider traps" -- which take advantage of search mechanisms
to provide essentially endless loops which the spider will
proceed down until it eventually gives up and goes away.
However, beyond knowing that they exist, I've never
explored this option.
Another way to avoid excessive spidering is to make access
to protected areas work only by an HTTP POST operation
(i.e. a form submission). Spiders generally do not fill
out and submit web forms.
Cheers,
Terry
--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks http://www.anansispaceworks.com
More information about the Discuss
mailing list