How To Verify You Have the Proper Robots.txt File

How To Verify You Have the Proper Robots.txt File

If you have a website or are involved in online marketing, you’ve likely heard about the robots.txt file. When implemented incorrectly, this file can have very negative and unintended consequences, such as blocked pages and resources. Imagine trying to rank for a keyword on a page that Google can’t access. Read this article to find out what a proper robots file looks like and to verify whether your website has one.

 

What Is It?

The robots file is located at http://www.website.com/robots.txt. It lets search engine crawlers know what parts of your website you do not want them to crawl. It is the very first location of your website that a search engine will visit.

 

What Does It Look Like?

A robots.txt file contains three main parts and concepts you should understand.

User-agent

robotsoverview

This command dictates which crawlers are allowed to crawl your website. Websites most commonly use * for the user-agent because it signifies “all user agents.”

The three main search engine user agents are:

  • “Googlebot” for Google
  • “Bingbot” for Bing
  • “Yahoo! Slurp” for Yahoo

If you wanted to give a command for a specific crawler, you would place the user-agent ID in the user-agent location. Each time you refer to a crawler, you would need a separate set of disallow commands.

For example, you would list Googlebot as the user-agent and then notify the crawler what pages to disallow.

Most reputable crawlers like Google, Yahoo, and Bing will follow the directive of the robots.txt file. Spam crawlers (that usually show up as traffic to your website) are less likely to follow the commands. Most of the time, using the * and giving the same command to all crawlers is the best route.

Disallow

Sitemaplocation

This command lets crawlers know which files or pages on your website you do not want them to crawl. Typically, disallowed files are customer-sensitive pages (like checkout pages) or backend office pages with sensitive information.

Most problems in the robots.txt occur within the disallow section. Issues arise when you block too much information in the file. The example above shows an appropriate file to disallow. Any files that begin with /wp-admin/ will not be crawled.

robotsblocking

What is the above disallow command telling the crawlers? In this situation, the crawlers are told not to index any of the pages on your website. If you want your website visible in the search engines, then including a single / in the disallow section is detrimental to your search visibility.

Google even sends out Search Console messages letting websites know if the robots.txt file blocks information it needs to crawl, like CSS files and Javascript.

searchconsoletestrobots

If you want to lean on the safe side, let all crawlers crawl every page on your website. You can do this by not disallowing anything.

robotsnonblocking

As you can see, the disallow command is followed by a blank space. When a search crawler sees this, it will go ahead and crawl all pages it finds on your website.

Sitemap

robotsnonblocking

A robots.txt file can also include the location of your website’s sitemap, which I would highly recommend adding. The sitemap, if you have one, is the second place a crawler will visit after your robots.txt file. Make sure the sitemap lists your webpages, specifically the ones you are trying to market.

 

How Do I Verify that I Set Up My Robots.txt File Correctly?

Within Search Console, you’ll find a tool to test your Robots.txt file.

robots.txttester

If there are any problems or errors with your robots.txt file, Search Console will let you know. You can even input specific URLs to make sure Google indexes them.

Remember that a search engine’s role is to a) crawl, b) index, and c) provide results.

The robots.txt file can block pages and sections that a search engine should crawl but not necessarily index. For example, if you create a link and point it to a webpage, Google could crawl that link and index the page that the link points to. Any time that Google indexes a page, it could show up in a search result.

If you don’t want a webpage to show up in a search result, include that information on the page itself. Include the code <meta name=”robots” content=”noindex”> on the specific page you don’t want search engines to index.

The robots.txt file is certainly a more technical aspect of SEO, and it can get confusing. While this file can be tricky, simply understanding how a robots.txt file works will help you verify that your website is as visible as possible.

Colton Miller
[email protected]

Colton is the Director of SEO Strategy at Boostability testing and defining the products and processes that make Boostability's customers successful. He has been a part of Boostability for over 7 years. Colton loves hanging out with his family and gaming. He runs a personal blog over at www.coltonjmiller.com where he discusses gaming, life, and SEO.