|
The first line is the User-agent line. This is the line where
you can specify which search spider bots are allowed to index your
site(s). The second line is the directive line or disallow field.
This is the line you will use to block folders or files blocked
from spiders.
Of particular note: if the publishers of Perfect 10 magazine (an
online porn magazine suing Google for linking to their images) had
used the robots.txt file, they could stop search spiders from
indexing their images. Is it Googles' fault the magazine hired
incompetent IT staff? I don't think so. To me it's another adult
webmaster looking for more free publicity.
To write the robots.txt file, you would start by addressing
specific search engines. The User-agent line would start as:
User-agent:
Adding a specific search engines spider name here will give the
search spider notice that it is to follow the next line for
instruction, i.e.:
User-agent: googlebot
This tells googlebot that it is to follow the next line's
directions on how to proceed through your website, or to leave
altogether. You may also employ the use of an asterisk (*) as a
wildcard for all search spiders.
The second line known as the directive is written as:
Disallow:
By adding a folder after the Disallow statement, the search spider
should ignore the folder for indexing purposes and move to others
where there is no restriction.
Disallow: /images/
This is a special example, just for Perfect 10. This one minute bit
of instruction could have saved a ton in wasted legal fees on a
frivolous lawsuit. As this is a basic step in building websites, it
is incumbent on website owners to protect their intellectual
property, and not a 3rd party search engines duty.
You can also disallow specific files this way
Disallow: cheeseyporn.htm
One way I recommend using this all the time is to keep robots out
of you cgi bin directory
Disallow: /cgi-bin/
If you leave the Disallow directive line blank or not filled in,
this indicates that ALL files may be retrieved and or indexed by
specifiedl robot(s). This would let all robots index all files.
User-agent: *
Disallow:
And vice versa you can keep all robots out easily.
User-agent: *
Disallow: /
In the example above, the one forward slash (/) equals your root
directory. Since the root directory is blocked, none of the other
folders and files can be indexed or crawled. Your site will be
removed from search engines once they read your robots.txt and
update their indexes.
|