Using Robots.txt to Control Search Engines

1 - I can do better 2 - Jury's out 3 - Pretty darn good 4 - Splendiferous 5 - Awesometastic by 0 people | Log in to rate

Ranked #1,633 in SEO, #108,791 overall

Introduction to robots.txt


Robots.txt is a text file located in the root directory of your web site written to instruct search engine robots and spiders where they are allowed to crawl. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.

 

How to create robots.txt file?


If a site owner wishes to give instructions to web robots he must place a text file called robots.txt to the root of the web site hierarchy (e.g. www.website.com/robots.txt). You can create the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file and the filename should be lowercase.

Include the robots.txt file in your server's root directory. This is standard web management practice. It must be in the main directory because otherwise user agents (search engines) will not be able to find it - they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory and if they don't find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. This text file should contain the instructions in a specific format The structure of a robots.txt is pretty simple:

# Comments appear after the "#" symbol at the start of a line, or after a directive

# this example allows all robots to visit all files
User-agent: *
Disallow:

# exclude all robots from part of the server
User-agent: *
Disallow: /scripts/
Disallow: /images/
Disallow: /admin/

# Example that tells all crawlers not to enter one specific file
User-agent: *
Disallow: /dir/file.html
Disallow: /dir/file2.html

# allow google image bot to search all images
User-agent: Googlebot-Image
Allow: /*

# Block all images on your site from Google image search:
User-agent: Googlebot-Image
Disallow: /

# To remove a specific image from Google Images
User-agent: Googlebot-Image
Disallow: /images/image.jpg

# To remove a specific file type from Google Images (for example, .gif)
User-agent: Googlebot
Disallow: /*.gif$

# disallow WayBack archiving site
User-agent: ia_archiver
Disallow: /

# disallow all files with ? in url
User-agent: *
Disallow: Disallow: /*?*

# Sitemap
Sitemap: http://www.domain.com/sitemap.xml

All search engines, or at least all the important ones, now look for a robots.txt file as soon their spiders your web site. So, even if you currently do not need to exclude the spiders from any part of your site, having a robots.txt file is still a good idea, it can act as a sort of invitation into your site.


 

SEO: Search Engine Optimization Bible

Avg. Customer Rating: Amazon Rating

Amazon Price: $26.39 (as of 12/31/2009) Buy Now

Usually ships in 24 hours

Professional Search Engine Optimization with PHP: A Developer's Guide to SEO

Avg. Customer Rating: Amazon Rating

Amazon Price: $29.19 (as of 12/31/2009) Buy Now

Usually ships in 24 hours

 

Robots.txt Optimization for WordPress

type=textSpecifying where search engines should look for content in high-quality directories or files you can increase the ranking of your site, and is recommended by Google and all the search engines. An example WordPress robots.txt file:

User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads

# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*

# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

# Internet Archiver Wayback Machine
User-agent: ia_archiver
Disallow: /

# digg mirror
User-agent: duggmirror
Disallow: /

Sitemap: http://www.example.com/sitemap.xml

 

Building Web 2.0 Business Websites: Business Process Innovation with Web 2.0 and Joomla!

Avg. Customer Rating: Amazon Rating

Amazon Price: (as of 12/31/2009) Buy Now

Website Optimization

Avg. Customer Rating: Amazon Rating

Amazon Price: $26.39 (as of 12/31/2009) Buy Now

Usually ships in 24 hours

 

Are You Ready to Completely Dominate Google?

SEO Elite 4.0 is the latest version of the powerful search engine optimization program created by Brad Callen. SEO Elite is software created for internet marketers who want to be in top 10 position on Google, MSN and YAHOO! This software saves hours of time for analyze competition and link building strategies. If you want to be beter from your competition in the rankings, you must get this software!

This guide is easy to understand and gets straight to the point. It will take you step by step through the advance techniques of search engine optimization that only the gurus and the most successful webmasters took years to master.

If you want to be beter from your competition in the rankings, you must get this software. Get SEO Elite and increase your Google rankings as fast an humanly possible!

 

type=textDiscover The #1 Secret To Slapping Google

How many programs do we need to try before we can really get on and make some money? Discover The #1 secret to slapping google, taking the top rankings, and generating up to $1,384 per day on total autopilot

Watch the video below to learn how you can do this too within 15 minutes from now.

Watch the Video ยป

 

Comments

submit

by vojin

Thanks for checking out my lens, be sure to visit my other lenses as well. Hope you will find some interesting stuff there. (more)

Explore related pages

Create a Lens!