|
You can provide multiple Disallows to one User-agent. In the
following example, all spiders will be told not to index the
cgi-bin and the images directories.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
We can also use the robots.txt file to help improve search engine
rankings that we may have achieved with a dynamic page such as php.
Googlebot may have problems with them if there are too many
variables in the Session IDs of the URL.
A URL with session IDs will look similar to the below:
http://www.yourcoolsite.com/cat.php?par=887&show=subcats?=0431Tr
If your cool website is written in php and is converted into HTML
pages for googlebot to index, the robot will still try to index the
php pages. After copying the pages from php to HTML, place each set
of pages in their own folder. Title them something easy for you to
remember. Place all the php pages into a folder named "php."
This will allow you to leave the HTML pages under the root
directory which is easily indexed by the spiders. Then using what
you have learned so far, implement the following in your robots.txt
file:
User-agent: googlebot
Disallow: /php/
Now we have kept googlebot out of the php pages, which the bot
usually has problems crawling. It leaves the spider to crawl the
more friendly html pages, and it will not see your original content
duplicated on your site between the php and html versions. If the
pages are cleanly coded, this will often result in improved
rankings in all three of the major search engines.
You can also use comments in your robots.txt file, but you need to
be careful of where they are used.
Disallow: /images/ #comment send googlebot away
We could run into a problem if a search spider bot attempts to
disallow /images/#comment, which is a not a folder on the server
and would more than likely tell the bots to just leave the website
altogether.
It is better to leave your comments on their own separate line. See
the example below.
#keeps googlebot out of my porn
User-agent: googlebot
Disallow: /images/
So as we can see there is a very valid and legitimate reason to
use the robots.txt file. There are also numerous other times to use
the file. In some cases it could stop a large company from looking
like fools for not protecting their intellectual property, and in
others it would stop sensitive data from being crawled and indexed
over the internet, and also to help a site increase positions in
the natural organic search results listings.
After you have written your robots.txt file and placed it on
your server, you should validate it with one of the robots.txt
validation tools online.
|