Sunday, July 1, 2012

How to Create Robots.txt and Use of Wildcard and Dollar patters Match


How to Create Robots.txt


The simplest robots.txt file uses two rules:
  • User-agent: the robot the following rule applies to
  • Disallow: the URL you want to block
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.
Each section in the robots.txt file is separate and does not build upon previous sections. For example:
User-agent: *  Disallow: /folder1/    User-Agent: Googlebot  Disallow: /folder2/  
In this example only the URLs matching /folder2/ would be disallowed for Googlebot.

User-agents and bots

A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
User-agent: *  
Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up specific rules for these specific bots as well.

Blocking user-agents

The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
  • To block the entire site, use a forward slash.
    Disallow: /
  • To block a directory and everything in it, follow the directory name with a forward slash.
    Disallow: /junk-directory/
  • To block a page, list the page.
    Disallow: /private_file.html
  • To remove a specific image from Google Images, add the following:
    User-agent: Googlebot-Image  Disallow: /images/dogs.jpg 
  • To remove all images on your site from Google Images:
    User-agent: Googlebot-Image  Disallow: / 
  • To block files of a specific file type (for example, .gif), use the following:
    User-agent: Googlebot  Disallow: /*.gif$
  • To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn't share pages with the other Google user-agents. For example:
    User-agent: *  Disallow: /    User-agent: Mediapartners-Google  Allow: /
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.
Googlebot supports submission of Sitemap files through the robots.txt file.

Pattern matching ( Wildcard and Dollar)

Googlebot (but not all search engines) respects some pattern matching.
  • To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
    User-agent: Googlebot  Disallow: /private*/
  • To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
    User-agent: Googlebot  Disallow: /*?
  • To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
    User-agent: Googlebot   Disallow: /*.xls$
    You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
    User-agent: *  Allow: /*?$  Disallow: /*?
    The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
    The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.

3 comments: