Robots.txt is a file usually placed at the root of the website. It is used to allow or disallow search engine robots to crawl different web pages and folders within the website. You can selectively enable or disable robot’s crawl on different directories or web pages.
Why do we need a Robots.txt
Generally we don’t need it. Search engines do index sites without Robots.txt files but from SEO point of view, it can be one of the most important things. There may be certain pages or folders in your website that you may not want search engines to index. Typically folders/ web pages that most of people want to exclude for robot’s crawling are:
- Search result pages within the website.
- Database generated pages that duplicates content withing the site.
- Folders kept for personal use or junk folders.
Search engine bots take information from robots.txt before crawling the website. So it is always advisable to keep a robots.txt file on the server. After a period of time you may want to change robots.txt file content, this change will be noted by SE spiders and bots and will be good from SEO point of view. Moreover you can work around with your website more comfortably when you have certain control over the indexing of website.
It is generally placed at the root of the website. If you place it in some inner folder, you can control only the content within that folder.
Creating Robots.txt file.
Robots.txt file creation is probably the simplest thing that a webmaster can do. Before we move on to tips and examples on how to create a robots.txt file, you should be familiar with the definitions used in the website.
- User-agent:This describes a specific robot to follow the action as directed by robots.txt file. If you write “User-agent: Googlebot”, this will direct only Googlebot to follow the actions through robots.txt. To describe all robots/ spiders to follow the robots.txt’s directions, you should use the following syntax – User-agent: *
- Disallow: Disallow is used to block certain web pages/ folders. Disallow: followed by the complete url of the file or folder is used. E.g. Disallow: /tools/junk.html will forbids crawling of junk.html placed in tools folder.
- Allow:This is a new directive used allow crawling of certain web pages or folders within a disallowed directory. Basically it clearly defines which folder or pages should be crawled by the search engine bots.
- Sitemap: This directive is used to tell robots about the location of sitemap of the website. E.g. Sitemap: http://www.sitename.com/sitemap.xml
- Crawl-delay: Used to define the rate of bot’s crawl. This is generally used in case when you believe robots are slowing down your website’s speed. For example – To direct Yahoo Slurp to crawl the website once in every 5 second, the following Crawl-delay technique is used:
Examples of Robots.txt file.
To allow everything on webserver (within the folder where robots.txt file is kept.)
To disallow everything on webserver (within the folder where robots.txt file is kept.)
To block a particular directory
To block a specific web page
To block files of specific type
Robots.txt file points to be noted.
- A hash (#) at start indicates a comment.
- New user-agent section is introduced by a blank line in between.
- URL paths are case sensitive.
- The robots.txt protocols are followed by user-agents that adhere to robots.txt standards.
Most Common Robots.txt Mistake
Now a most common mistake people make with Robots.txt is to use the forward slash “/” after “Disallow: ” accidentally or because of lack of knowledge about it. I am talking about completely disabling robots to crawl your website. It happens at times that by mistake we write something like this:-
Whenever this happens, you are directing the search engine bot not to crawl and index the website. This will typically result your results to come up like this:
So if you encounter something like this where your website stops ranking for the keywords and if you type in your website’s domain and just see a url with no description snippet, link or similar pages as shown above, the first thing that you want to check is your robots.txt. MAKE SURE YOU WORK WITH “/” CORRECTLY”
I haven’t covered much of wildcard stuff with robots.txt here. Will come up on it in some latter post.