Controlling Visits by Robots

Tip 3 : Robot Visits

Monitoring the activity of robots is an important function of web site administration. Robots used by the Search Engines (e.g. Google ➚) continually scan web sites to keep their indices up to date. Once search engines are regularly visiting a web site you may want to control which areas of the site are visited. This is controlled by a file named robots.txt that is located in the root folder of a domain.

The file contains a set of directives that a robot should read before scanning a site, it states which pages are to be included and excluded from such scans. You can use this facility to prevent pages being indexed by search engines. For example you may have a set of test pages that you don't want to be seen yet or you might want to keep images used on the site from public search.

Example /robots.txt :


User-agent: *

Disallow: /newversion/

Disallow: /directory.htm

This excludes access to the newversion folder and a specific file /directory.htm the rule applies to all robots (you can make it apply to specific agents if you wish).

You can also indicate that you do not want robots to reference a page using the META tags within the page itself :


<meta name="robots" content="noindex">

For further details please visit : www.RobotsTxt.org ➚.

Site Vigil takes notice of the robots.txt directives when it scans web sites.