Updated: 06/13/06
The easiest way to create a properly formatted robots.txt file is to use a program like Robot-Manager * or Robot-Manager Pro *. This is what I use to create robots.txt files. I purchased the "Pro" edition to have the capabilities of analyzing my log files to see which robots visit and what they get on their "crawl". This program will help identify any potential problems the spiders may encounter. Being able to identify and correct any search engine robot problems, will pay for the program with one use. Robot Manager uses a simple user interface that makes creating your robots.txt file a breeze.
Want to get a free copy of Brad Callen's SEO Made Easy e-book? Click here for more information (will open in a new window for your convenience).
The robots.txt file is a special file that tells search engine and specialty robots, spiders, and crawlers what pages and files they are allowed to index on your website (this is done by disallowing access to certain files and directories). This is a powerful tool, but special care must be taken to make sure it is created correctly.
This text file is placed in the root of your web server (i.e.: public_html, www, htdocs or similar). My real robots.txt file can be viewed by clicking on this link http://webseodesign.com/robots.txt (opens in a new window for easier viewing).
Update: Google appears to be imposing a 5,000 character limit on robots.txt (thanks Barry for pointing that out).
Although most reputable search engines will respect your robots.txt file, remember not all bots will obey this file (i.e. scraper bots and email address harvesters). Any truly sensitive information should be password protected or use http status code 403 forbidden (or both).
Robots.txt files must be edited in a text editor using "Unix" mode and uploaded via ftp in the "ASCII" mode. This is necessary to preserve the "Unix line ender". Do not use a regular DOS text editor (like Windows Notepad) unless it has a "Unix" mode (note: though it appears that some bots will read a DOS "line ender" file, it is best not to take a chance). UltraEdit 32 * (free trial) and PSPad (freeware) both work well.
The two basic building blocks of the file are the "User-agent" and the "Disallow" statements.
The "User-agent" field can have different values that are denoted by the bot name or a "wild card" character " * " (asterisk). Examples are:
# To specify the User-agent for Google:
User-agent: googlebot
# To specify the User-agent for all bots, use the "wild card" character:
User-agent: *
The "Disallow" statement specifies what the previously stated "User-agent" should not access: the entire site, certain directories and/or files. Examples are:
# To disallow access to a specific file in the root
directory:
Disallow: /private-stuff.html
# To disallow access to an entire directory:
Disallow: /images/
# To disallow access to the entire site:
Disallow: /
Here are a few simple examples:
# This allows all bots everywhere:
User-agent: *
Disallow:
# This disallows all bots everywhere:
User-agent: *
Disallow: /
# This disallows all bots to several directories and files:
User-agent: *
Disallow: /private-file.html
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /search.php
# This disallows access to engine specific pages:
User-agent: googlebot
Disallow: /index-msn.html
User-agent: msnbot
Disallow: /index-google.html
Comments are preceded by the " # " sign and should be on a line by themselves.
There are several other ways you can use wild cards " * ", partial file names and ordering of the conditional statements to achieve just about anything. Remember to make sure that you validate your file and monitor the bots activity to verify compliance.
For additional information on robots.txt files:
http://www.robotstxt.org/wc/robots.html
http://www.searchengineworld.com/robots/robots_tutorial.htm
http://www.webmasterworld.com/forum93/
Validate your robots.txt files:
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
* Disclaimer: We have an affiliate relationship with some of the sites, products and services listed here. All such sites, products or services are denoted with *. We only recommend products because they are, in our opinion, one of the best available in their market. We use all the sites, products and services we recommend.